Matched tumor/normal pairs--a use case for the GenomicDataCommons Bioconductor package

The NCI Genomic Data Commons (GDC) is a reboot of the approach that NCI uses to manage and expose genomic and associated clinical and experimental metadata. I have been working on a Bioconductor package that interfaces with the GDC API to provide search and data retrieval from within R.

In the first of what will likely be a set of use cases for the GenomicDataCommons, I am going to address a question that came up on twitter from @sleight82

Asking for a non-tweeter: can you find matched control samples?

{{ tweet 837898737540198400 }}

The answer is, “Yes.” I am going to assume that what the “non-tweeter” friend meant was that he or she was interested in finding matched tumor/normal data related to, for example, gene expression, and that the interest is in a specific disease category or TCGA. So, I am going to answer the more specific question:

Can you find matched primary tumor/solid tissue normal samples and associated RNA-Seq gene expression files from TCGA breast cancer cases?

library(GenomicDataCommons)
## Loading required package: magrittr
## 
## Attaching package: 'GenomicDataCommons'
## The following object is masked from 'package:stats':
## 
##     filter

The GDC API puts some constraints on what can be done directly. But, since we are working in R, we have a toolbox that allows us to build a solution out of component parts. The strategy that I am going to employ is a three-step approach

  1. Find RNA-Seq gene expression files derived from “Solid Tissue Normal”
  2. Find RNA-Seq gene expression files derived from “Primary Tumor”
  3. Limit cases from #1 and #2 that have gene expression results from both normal and tumor.

In this code block, find all “HTSeq - Counts” files that are derived from “Solid Tissue Normal” in the project “TCGA-BRCA”. I used a combination of files() %>% select(available_fields('files') %>% results() to get an overview of the data available in the files() endpoint, followed by grep_fields() and available_values() to determine how to build filters.

nl_ge_files = files() %>%
    filter(~   cases.samples.sample_type=='Solid Tissue Normal' &
               cases.project.project_id == 'TCGA-BRCA' &
               analysis.workflow_type == "HTSeq - Counts") %>%
    GenomicDataCommons::select(c(default_fields('files'),
                                 'cases.case_id',
                                 'cases.samples.sample_type')) %>%
    results_all() %>%
    as.data.frame()

And do the same, but now looking for gene expression from tumors.

tm_ge_files = files() %>%
    filter(~   cases.samples.sample_type=='Primary Tumor' &
               cases.project.project_id == 'TCGA-BRCA' &
               analysis.workflow_type == "HTSeq - Counts") %>%
    GenomicDataCommons::select(c(default_fields('files'),
                                 'cases.case_id',
                                 'cases.samples.sample_type')) %>%
    results_all() %>%
    as.data.frame()

Now, we have two data frames describing the normal- and tumor-derived TCGA-BRCA gene expression files. Note that I asked the GDC to provide cases.case_id as part of the record using select(). By looking for the intersection of cases between these two sets of files, we can find cases that contain files derived from both tumor and normal tissue.

matchedcases = intersect(tm_ge_files$cases.case_id,
                         nl_ge_files$cases.case_id)
# how many matched cases?
length(matchedcases)
## [1] 112

We can now create a data.frame that contains file information for only the matched samples:

matched_tn_ge_file_info = rbind(subset(nl_ge_files,cases.case_id %in% matchedcases),
                                subset(tm_ge_files,cases.case_id %in% matchedcases))
head(matched_tn_ge_file_info)
##                                file_id                      data_type
## 1 0168550b-f361-49e2-832e-913e9e783240 Gene Expression Quantification
## 2 0389c08d-927d-4254-8d7f-f44963594db3 Gene Expression Quantification
## 3 06024e67-5afb-4bd7-98af-1a5ef7abd554 Gene Expression Quantification
## 4 091a502c-544e-421b-8024-62a4807cccf5 Gene Expression Quantification
## 5 0b1a91bc-78c2-4ca8-9067-cd18804a42a5 Gene Expression Quantification
## 6 0bcdef87-7c2c-4c26-a30a-7cefee80d0ca Gene Expression Quantification
##                   updated_datetime                 created_datetime
## 1 2016-09-29T17:33:56.586700-05:00 2016-05-30T18:48:02.044327-05:00
## 2 2016-09-29T17:37:26.925468-05:00 2016-05-29T11:14:20.886223-05:00
## 3 2016-09-29T17:41:15.388673-05:00 2016-05-29T10:24:00.271641-05:00
## 4 2016-09-29T17:45:55.150313-05:00 2016-05-30T18:35:58.038072-05:00
## 5 2016-09-29T17:49:06.065235-05:00 2016-05-30T18:31:01.014992-05:00
## 6 2016-09-29T17:50:12.284644-05:00 2016-05-26T21:20:48.935001-05:00
##                                              file_name
## 1 2eeb46a1-0422-4a77-b58d-3b7d57d51b44.htseq.counts.gz
## 2 2826bcb9-5ddb-49c5-96fd-a3dba5885e7a.htseq.counts.gz
## 3 3a04fb30-378c-48a0-a02e-85cf81b31b43.htseq.counts.gz
## 4 54a62dcd-c9a6-4a97-85c9-62e364766fca.htseq.counts.gz
## 5 95c0cf0b-7077-4a93-9590-3948e55c6e31.htseq.counts.gz
## 6 c3ed7c7a-b80a-4de2-8dfb-b60bcf336375.htseq.counts.gz
##                             md5sum data_format access     state
## 1 b11432b3e7e9ece3a8a31d84d7ddc693         TXT   open submitted
## 2 c4f5255f3e0d553ac2a2dbfd4eae8610         TXT   open submitted
## 3 bb66f1f847107d7d2ad310e6fa720712         TXT   open submitted
## 4 c4beca6838ff9edb43505a11e6547898         TXT   open submitted
## 5 7d2a48c37c4d834528c5de69695087ae         TXT   open submitted
## 6 2954040a9b69d1e8f5dd45e7a0db70e0         TXT   open submitted
##             data_category file_size
## 1 Transcriptome Profiling    258320
## 2 Transcriptome Profiling    257186
## 3 Transcriptome Profiling    255526
## 4 Transcriptome Profiling    255703
## 5 Transcriptome Profiling    257562
## 6 Transcriptome Profiling    254835
##                                 submitter_id            type file_state
## 1 2eeb46a1-0422-4a77-b58d-3b7d57d51b44_count gene_expression  processed
## 2 2826bcb9-5ddb-49c5-96fd-a3dba5885e7a_count gene_expression  processed
## 3 3a04fb30-378c-48a0-a02e-85cf81b31b43_count gene_expression  processed
## 4 54a62dcd-c9a6-4a97-85c9-62e364766fca_count gene_expression  processed
## 5 95c0cf0b-7077-4a93-9590-3948e55c6e31_count gene_expression  processed
## 6 c3ed7c7a-b80a-4de2-8dfb-b60bcf336375_count gene_expression  processed
##   experimental_strategy  acl                        cases.case_id
## 1               RNA-Seq open cab38ea8-4555-413c-b81e-4621aac4ef85
## 2               RNA-Seq open e666081b-caaf-4e8b-9446-807be71201a5
## 3               RNA-Seq open 4a032bad-e726-48f2-8f39-e3acc109cc91
## 4               RNA-Seq open a32cb96a-78bf-456e-a6ab-2d47b2c67ad4
## 5               RNA-Seq open 994ca1f5-ad10-44ec-aa21-71fc2940653b
## 6               RNA-Seq open 2021ed1f-dc75-4701-b8b8-1386466e4802
##         cases.samples
## 1 Solid Tissue Normal
## 2 Solid Tissue Normal
## 3 Solid Tissue Normal
## 4 Solid Tissue Normal
## 5 Solid Tissue Normal
## 6 Solid Tissue Normal

Since the gene expression data are not that big, we can use the GDC API to download the data files directly. I use the BiocParallel package to facilitate parallel downloads.

library(BiocParallel)
register(MulticoreParam())
destdir = tempdir() # use a temp directory for example, change for yourself
fnames = bplapply(as.character(matched_tn_ge_file_info$file_id),
                  function(file_id) {
                      gdcdata(file_id, progress = FALSE,
                              destination_dir = destdir,
                              overwrite = TRUE)
                  })

Now, fnames should be a list of file names that can be read into and processed with R.

sessionInfo()
## R Under development (unstable) (2016-10-26 r71594)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] GenomicDataCommons_0.99.8 magrittr_1.5             
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9       xml2_1.1.1        knitr_1.15.1     
##  [4] xtable_1.8-2      R6_2.2.0          stringr_1.2.0    
##  [7] httr_1.2.1        tools_3.4.0       data.table_1.10.4
## [10] miniUI_0.1.1      htmltools_0.3.5   assertthat_0.1   
## [13] yaml_2.1.14       lazyeval_0.2.0    rprojroot_1.2    
## [16] digest_0.6.12     tibble_1.2        bookdown_0.3     
## [19] shiny_1.0.0       readr_1.0.0       curl_2.3         
## [22] evaluate_0.10     mime_0.5          rmarkdown_1.3    
## [25] blogdown_0.0.18   stringi_1.1.2     methods_3.4.0    
## [28] backports_1.0.5   jsonlite_1.3      httpuv_1.3.3
comments powered by Disqus