Extracting Clinical Information Using the Genomicdatacommons Package

This short post introducds the gdc_clinical() function recently added to the GenomicDataCommons package.

The rich data model at the NCI Genomic Data Commons (GDC) includes clinical and biospecimen details. A recently added feature to the NCI GDC Data Portal is the ability to download tab-delimited files or JSON files for clinical and biospecimen details of samples. The details available in these simplified formats are also available via the GDC API.

library(GenomicDataCommons)

The clinical infomation at the GDC is encapsulated in the cases records. Here, I introduce the gdc_clinical() function from the GenomicDataCommons package that takes as input GDC case IDs and returns a set of four related data.frames:

  • main: important case metadata
  • demographics: basic demographic information
  • exposures: zero or more documented exposures
  • diagnoses: zero or more diagnoses per case

As an example application, we will examine the clinical details for 100 lung adenocarcinoma patients from TCGA (“TCGA-LUAD”). The case ids are available through a basic cases() query, filtered to include only samples that belong to the project from TCGA-LUAD (stored in the project.project_id field). The query to get these case_ids, then, looks like:

case_ids = cases() %>%
    filter(~ project.project_id == 'TCGA-LUAD') %>%
    results(size=100) %>%
    ids()
head(case_ids)
## [1] "ffb0c0b7-165e-4439-b3e6-62431f40b7fe"
## [2] "f98ecd8a-b878-4f53-b911-20cd8e17281c"
## [3] "ccfdad76-cc45-447f-bed8-ede8f6a8844d"
## [4] "9fcdccae-676e-4071-93c3-23d2d3ab0c00"
## [5] "0232d299-4cdf-4fd7-9a5e-8d13c208b40c"
## [6] "bc4c4079-b449-485d-84e4-a40496e563e8"

These case_ids, each representing a single case (patient) in the GDC, can be be fed directly to gdc_clinical.

clin_res = gdc_clinical(case_ids)

The result is a list of data.frames:

names(clin_res)
## [1] "demographic" "diagnoses"   "exposures"   "main"

The the dimensions of these data.frames is instructive.

sapply(clin_res, dim) %>%
    t() %>%
    data.frame() %>%
    set_names(c('rows','columns'))
##             rows columns
## demographic  100      11
## diagnoses     82      23
## exposures     82      13
## main         100       8

And the column names for each are helpful to examine:

sapply(clin_res, colnames)
## $demographic
##  [1] "updated_datetime" "created_datetime" "gender"          
##  [4] "year_of_birth"    "submitter_id"     "state"           
##  [7] "race"             "demographic_id"   "ethnicity"       
## [10] "year_of_death"    "case_id"         
## 
## $diagnoses
##  [1] "case_id"                          
##  [2] "classification_of_tumor"          
##  [3] "last_known_disease_status"        
##  [4] "updated_datetime"                 
##  [5] "primary_diagnosis"                
##  [6] "submitter_id"                     
##  [7] "tumor_stage"                      
##  [8] "age_at_diagnosis"                 
##  [9] "vital_status"                     
## [10] "morphology"                       
## [11] "days_to_death"                    
## [12] "days_to_last_known_disease_status"
## [13] "created_datetime"                 
## [14] "state"                            
## [15] "days_to_recurrence"               
## [16] "diagnosis_id"                     
## [17] "tumor_grade"                      
## [18] "tissue_or_organ_of_origin"        
## [19] "days_to_birth"                    
## [20] "progression_or_recurrence"        
## [21] "prior_malignancy"                 
## [22] "site_of_resection_or_biopsy"      
## [23] "days_to_last_follow_up"           
## 
## $exposures
##  [1] "case_id"            "cigarettes_per_day" "weight"            
##  [4] "updated_datetime"   "alcohol_history"    "alcohol_intensity" 
##  [7] "bmi"                "years_smoked"       "submitter_id"      
## [10] "created_datetime"   "state"              "exposure_id"       
## [13] "height"            
## 
## $main
## [1] "updated_datetime" "submitter_id"     "case_id"         
## [4] "id"               "disease_type"     "created_datetime"
## [7] "state"            "primary_site"

Note the each data.frame contains a case_id column by design to allow arbitrary joining of the tables to each other. In this case, the data relationships are not too complicated, but one might imagine situations arising that include many-to-many relationships that are hard to handle in a fully general way without some understanding of downstream use (what do we want to do with the clinical information?). In this relatively simple case, we can create a “master” data.frame by joining all the records from each data.frame.

library(dplyr)
full_clin = with(clin_res,
     main %>%
     left_join(demographic, by = "case_id") %>%
     left_join(exposures, by = "case_id") %>%
     left_join(diagnoses, by = "case_id"))

Above, I have used the tidyverse approach, applying dplyr left_join()s. Using base R merge would also work.

dim(full_clin)
## [1] 100  52
colnames(full_clin)
##  [1] "updated_datetime.x"               
##  [2] "submitter_id.x"                   
##  [3] "case_id"                          
##  [4] "id"                               
##  [5] "disease_type"                     
##  [6] "created_datetime.x"               
##  [7] "state.x"                          
##  [8] "primary_site"                     
##  [9] "updated_datetime.y"               
## [10] "created_datetime.y"               
## [11] "gender"                           
## [12] "year_of_birth"                    
## [13] "submitter_id.y"                   
## [14] "state.y"                          
## [15] "race"                             
## [16] "demographic_id"                   
## [17] "ethnicity"                        
## [18] "year_of_death"                    
## [19] "cigarettes_per_day"               
## [20] "weight"                           
## [21] "updated_datetime.x.x"             
## [22] "alcohol_history"                  
## [23] "alcohol_intensity"                
## [24] "bmi"                              
## [25] "years_smoked"                     
## [26] "submitter_id.x.x"                 
## [27] "created_datetime.x.x"             
## [28] "state.x.x"                        
## [29] "exposure_id"                      
## [30] "height"                           
## [31] "classification_of_tumor"          
## [32] "last_known_disease_status"        
## [33] "updated_datetime.y.y"             
## [34] "primary_diagnosis"                
## [35] "submitter_id.y.y"                 
## [36] "tumor_stage"                      
## [37] "age_at_diagnosis"                 
## [38] "vital_status"                     
## [39] "morphology"                       
## [40] "days_to_death"                    
## [41] "days_to_last_known_disease_status"
## [42] "created_datetime.y.y"             
## [43] "state.y.y"                        
## [44] "days_to_recurrence"               
## [45] "diagnosis_id"                     
## [46] "tumor_grade"                      
## [47] "tissue_or_organ_of_origin"        
## [48] "days_to_birth"                    
## [49] "progression_or_recurrence"        
## [50] "prior_malignancy"                 
## [51] "site_of_resection_or_biopsy"      
## [52] "days_to_last_follow_up"

In conclusion, the gdc_clinical functionality from the GenomicDataCommons is a high-level function for capturing unified and harmonized clinical information for any case in the NCI GDC repository.

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.7.7              GenomicDataCommons_1.4.3
## [3] magrittr_1.5             knitr_1.20              
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.10.1 tidyselect_0.2.5           
##  [3] xfun_0.3                    purrr_0.2.5                
##  [5] lattice_0.20-35             htmltools_0.3.6            
##  [7] stats4_3.5.1                yaml_2.2.0                 
##  [9] rlang_0.3.0.1               pillar_1.3.0               
## [11] glue_1.3.0                  BiocParallel_1.14.2        
## [13] rappdirs_0.3.1              BiocGenerics_0.26.0        
## [15] bindrcpp_0.2.2              matrixStats_0.54.0         
## [17] GenomeInfoDbData_1.1.0      bindr_0.1.1                
## [19] stringr_1.3.1               zlibbioc_1.26.0            
## [21] blogdown_0.8                codetools_0.2-15           
## [23] evaluate_0.12               Biobase_2.40.0             
## [25] IRanges_2.14.12             GenomeInfoDb_1.16.0        
## [27] parallel_3.5.1              curl_3.2                   
## [29] Rcpp_0.12.19                readr_1.1.1                
## [31] backports_1.1.2             DelayedArray_0.6.6         
## [33] S4Vectors_0.18.3            jsonlite_1.5               
## [35] XVector_0.20.0              hms_0.4.2                  
## [37] digest_0.6.18               stringi_1.2.4              
## [39] bookdown_0.7                GenomicRanges_1.32.7       
## [41] grid_3.5.1                  rprojroot_1.3-2            
## [43] tools_3.5.1                 bitops_1.0-6               
## [45] RCurl_1.95-4.11             lazyeval_0.2.1             
## [47] tibble_1.4.2                crayon_1.3.4               
## [49] pkgconfig_2.0.2             Matrix_1.2-14              
## [51] xml2_1.2.0                  assertthat_0.2.0           
## [53] rmarkdown_1.10              httr_1.3.1                 
## [55] R6_2.3.0                    compiler_3.5.1
Professor of Medicine

My interests include biomedical data science, open data, genomics, and cancer research.

comments powered by Disqus