Matched tumor/normal pairs--a use case for the GenomicDataCommons Bioconductor package
The NCI Genomic Data Commons (GDC) is a reboot of the approach that NCI uses to manage and expose genomic and associated clinical and experimental metadata. I have been working on a Bioconductor package that interfaces with the GDC API to provide search and data retrieval from within R.
In the first of what will likely be a set of use cases for the GenomicDataCommons, I am going to address a question that came up on twitter from @sleight82
Asking for a non-tweeter: can you find matched control samples?
— Kenneth Daily (@sleight82) March 4, 2017
The answer is, “Yes.” I am going to assume that what the “non-tweeter” friend meant was that he or she was interested in finding matched tumor/normal data related to, for example, gene expression, and that the interest is in a specific disease category or TCGA. So, I am going to answer the more specific question:
Can you find matched primary tumor/solid tissue normal samples and associated RNA-Seq gene expression files from TCGA breast cancer cases?
The GDC API puts some constraints on what can be done directly. But, since we are working in R, we have a toolbox that allows us to build a solution out of component parts. The strategy that I am going to employ is a three-step approach
- Find RNA-Seq gene expression files derived from “Solid Tissue Normal”
- Find RNA-Seq gene expression files derived from “Primary Tumor”
- Limit cases from #1 and #2 that have gene expression results from both normal and tumor.
In this code block, find all “HTSeq - Counts” files that are derived from “Solid Tissue Normal” in the project “TCGA-BRCA”. I used a combination of files() %>% select(available_fields('files') %>% results()
to get an overview of the data available in the files()
endpoint, followed by grep_fields()
and available_values()
to determine how to build filters.
nl_ge_files = files() %>%
GenomicDataCommons::filter(~ cases.samples.sample_type=='Solid Tissue Normal' &
cases.project.project_id == 'TCGA-BRCA' &
analysis.workflow_type == "HTSeq - Counts") %>%
expand(c('cases','cases.samples')) %>%
results_all() %>%
And do the same, but now looking for gene expression from tumors.
tm_ge_files = files() %>%
GenomicDataCommons::filter(~ cases.samples.sample_type=='Primary Tumor' &
cases.project.project_id == 'TCGA-BRCA' &
analysis.workflow_type == "HTSeq - Counts") %>%
expand(c('cases','cases.samples')) %>%
results_all() %>%
Now, we have two data frames describing the normal- and tumor-derived TCGA-BRCA gene expression files. Note that I asked the GDC to provide cases.case_id
as part of the record using select()
. By looking for the intersection of cases between these two sets of files, we can find cases that contain files derived from both tumor and normal tissue.
nl_cases = bind_rows(nl_ge_files$cases, .id='file_id')
tm_cases = bind_rows(tm_ge_files$cases, .id='file_id')
matchedcases = intersect(nl_cases$case_id,
# how many matched cases?
## [1] 112
We can now create a data.frame
that contains file information for only the matched samples. Note
that the names of the matched cases are the file ids.
matched_nl_files = nl_cases[nl_cases$case_id %in% matchedcases, 'file_id']
matched_tm_files = tm_cases[tm_cases$case_id %in% matchedcases, 'file_id']
matched_tn_ge_file_info = rbind(subset(nl_ge_files,file_id %in% matched_nl_files),
subset(tm_ge_files,file_id %in% matched_tm_files))
## # A tibble: 6 x 17
## data_type updated_datetime file_name submitter_id file_id file_size cases
## <chr> <chr> <chr> <chr> <chr> <int> <lis>
## 1 Gene Exp… 2018-09-08T01:0… 6598740f… 6598740f-64… 7d62fa… 261441 <dat…
## 2 Gene Exp… 2018-09-08T01:0… d7047346… d7047346-49… ec3848… 258388 <dat…
## 3 Gene Exp… 2018-09-08T01:0… e84e835c… e84e835c-ce… af091e… 252106 <dat…
## 4 Gene Exp… 2018-09-08T01:0… 2c6a02ee… 2c6a02ee-c8… eed6ce… 258278 <dat…
## 5 Gene Exp… 2018-09-08T01:0… 73bcbb60… 73bcbb60-38… bde7dd… 253859 <dat…
## 6 Gene Exp… 2018-09-08T01:0… f1afa28b… f1afa28b-fa… 61da65… 259028 <dat…
## # ... with 10 more variables: id <chr>, created_datetime <chr>,
## # md5sum <chr>, data_format <chr>, acl <list>, access <chr>,
## # state <chr>, data_category <chr>, type <chr>,
## # experimental_strategy <chr>
Since the gene expression data are not that big, we can use the GDC API to download the data files directly. The GenomicDataCommons uses a cache for files, so the first time the code below is run, data will be downloaded. After that, if the cache is in place, the cached files will be used.
fnames = lapply(as.character(matched_tn_ge_file_info$file_id),
function(file_id) {
gdcdata(file_id, progress = FALSE)
Now, fnames
should be a list of file names that can be read into and processed with R.
