Contributing to the Portal - Journey through the data dictionary
I was excited to start on this journey of submission to the TSHS Resources Portal and then "life" got in the way. Isn't that how it always is? The good news is that this blog series is going to keep me accountable. So, on a Friday afternoon I decided to tackle something that seemed easy -- the data dictionary. We have published on these data so there is already a data dictionary of sorts. I simply have to translate it into the template table.
The template is straight forward. There is even an example. The challenge is that my chosen dataset is not simple. Let me recap.
The dataset that I'm looking to upload is The Cancer Genome Atlas - Clinical Data Resource (TCGA-CDR). This is a compilation of the key clinical variables collected across the 33 different tumor types studied by TCGA. The purpose of the TCGA-CDR is to create a consistent dataset that all users of TCGA can reference. We took the time to curate and QC the data. We also derived and assessed four common survival endpoint variables. The hope is that this will allow reproducible outcomes research for users of TCGA molecular data.
As we describe in the associated paper, it was not trivial to create this compilation. However, in the end we focused on 33 collected and derived variables to create a useful reference dataset across all tumor types.
In translating to the template, my first challenge is the number of levels for many of the categorical variables. Of note, the cancer "type" variable has 33 levels by default. For better or worse, our data dictionary pointed to our abbreviations list for the full definition of each. Anyone who has wrestled TCGA molecular data is intimately familiar with the type code they use. For instance I use LGG and GBM almost exclusively ("lower grade glioma" and "glioblastoma"). But this is not helpful for the classroom, so I put them all into the template.
I crunched through the rest of the 33 variables in the same fashion, translating our data dictionary into the data dictionary template. I realized that much of our data is text. Some is more free-form, such location of progressive event, and some more consistent, such as gender or stage. A few are quirky, such as histological grade which is granular for most (I, II, III, IV, X), but the bladder cancer team only collected "high" and "low."
There is only one variable that I will likely recode prior to uploading. That is the redaction indicator. Right now it's either "redacted" or <blank> and I've made a note to change it to 1 = redacted, 0 = not redacted. But, the rest will be uploaded as is, quirks and all. This dataset may be a bit messier than some in the Portal. However, it lends itself to plenty of discussion about data cleaning, clinical data form generation, and outcome variable definitions. The associated paper brings in some more advanced survival topics as well.
Walking through this exercise reminds me of why the TSHS Resources Portal is such an excellent tool. Yes, it has taken me a couple of hours to pull this together. However, this will be a one-and-done project. Once this dataset has made it into the portal I will not have to do this preparation exercise again... and neither will anyone else! Also, I have lived and breathed TCGA for over 7 years. So, I am very thankful to have peer reviewers look at the data dictionary to make sure I didn't skip over something that I take for granted.
Now that the data dictionary is done, I've got to move on to the "Introduction." We have a paper so it should be simple enough, right? Sounds like another good Friday-afternoon task. Stay tuned....
A note on permissions: As I have mentioned previously, these data are publicly available. They are posted as supplemental data for the paper which is open access. However, it's not clear if I have permission to re-post them. Since my last post, I've been in touch with the corresponding author, who reached out to the editor, who pointed me to a form "Obtaining Permission to Use Cell Press Material." No-one has concern about me using these data, but I've got to get official permission from the journal. The check boxes on the form are expecting that I'll use a graph or table in a presentation or similar, so hopefully I can express my scenario clearly enough. More to come on this.