GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation

One of the features of the NCI Genomic Data Commons is that everything has a unique identifier in the form of a UUID. However, because many legacy projects and much of the literature do not use UUIDs but instead use TCGA sample barcodes, one simple use case for the GenomicDataCommons package is to map from the UUID for a file or a set of files back to the associated TCGA barcode(s).

Given a set of file ids (which I simulate here by fetching some from the GDC API), I build a small function that maps those file UUIDs back to the associated TCGA barcodes. Because the GDC also contains data from TARGET as well as Foundation Medicine, the same code will return associated legacy identifiers for those types of samples as well.

Get started by loading the GenomicDataCommons library:


Fetch some file IDs:

file_uuids = files() %>% results(size=15) %>% ids()
## [1] "440e9ec5-8e61-4f75-b1d1-616941d9456d"
## [2] "978d124d-ae39-4fbb-9796-58babc948acb"
## [3] "3b13b023-0325-4d6c-850f-0f5534f7d42b"
## [4] "228f601c-b657-479b-bcf4-f25bb2c823e4"
## [5] "b245bdab-a585-4a55-8fb3-d92e1d12fb6b"
## [6] "25bcb389-4435-4365-a771-1e220086402f"

The TCGA barcodes are nested in the file records, but we can access them in the cases.samples.submitter_id. The available_fields and available_values functions are useful GenomicDataCommons package functions for examining available fields and their associated values to find information of interest. Another common approach is to fetch all available fields and then examine the results using str.

fres = files() %>%
    select(available_fields('files')) %>%

In the code below, I simply start with a files() query against the GenomicDataCommons API, filter to include only those files that match the supplied file_ids, and then gather the cases.samples.submitter_id and file UUIDs into a data frame. The most complicated (and fragile, since it will break if the GDC changes its data model) iw the lapply statement that accesses the barcodes in the nested results returned.


TCGAtranslateID = function(file_ids, legacy = FALSE) {
    info = files(legacy = legacy) %>%
        filter( ~ file_id %in% file_ids) %>%
        select('cases.samples.submitter_id') %>%
    # The mess of code below is to extract TCGA barcodes
    # id_list will contain a list (one item for each file_id)
    # of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
    id_list = lapply(info$cases,function(a) {
    # so we can later expand to a data.frame of the right size
    barcodes_per_file = sapply(id_list,length)
    # And build the data.frame
    return(data.frame(file_id = rep(ids(info),barcodes_per_file),
                      submitter_id = unlist(id_list)))

Now, we can translate our example file_uuids:

res = TCGAtranslateID(file_uuids)
##                                file_id     submitter_id
## 1 440e9ec5-8e61-4f75-b1d1-616941d9456d TCGA-24-1467-01A
## 2 978d124d-ae39-4fbb-9796-58babc948acb TCGA-13-1496-10A
## 3 978d124d-ae39-4fbb-9796-58babc948acb TCGA-13-1496-01A
## 4 3b13b023-0325-4d6c-850f-0f5534f7d42b TCGA-13-0908-10A
## 5 228f601c-b657-479b-bcf4-f25bb2c823e4 TCGA-61-2097-01A
## 6 b245bdab-a585-4a55-8fb3-d92e1d12fb6b TCGA-61-1998-10A

This little function is a bit “niche”, but it does illustrate how one can leverage GenomicDataCommons package functionality to create useful higher-level functionality like ID mapping.

  • EDIT [01-02-2018]: Added legacy flag to function to allow mapping of legacy file UUIDs. See comment below for rationale.
Sean Davis
National Cancer Institute, NIH
comments powered by Disqus