Sean Davis @ NCI

Infrastructure-as-Code: Building the Bioconductor Conference AMI With Packer

One of the main features of the annual Bioconductor Conference is the proportion of time spent working with code in the form of workshops. To support these workshops, we ask workshop presenters to supply Rmarkdown materials which we collate into workshop materials. Using literate programming approaches like Rmarkdown ensures that the workflows are self-consistent and work as expected. In addition to the Rmarkdown workshop materials, we also need a consistent computing environment that can support reasonably large computation, provide high-performance network and file system access, and is essentially unlimited in scale (we expect to have >150 participants, each with his/her own machine).

Extracting Clinical Information Using the Genomicdatacommons Package

This short post introducds the gdc_clinical() function recently added to the GenomicDataCommons package. The rich data model at the NCI Genomic Data Commons (GDC) includes clinical and biospecimen details. A recently added feature to the NCI GDC Data Portal is the ability to download tab-delimited files or JSON files for clinical and biospecimen details of samples. The details available in these simplified formats are also available via the GDC API.

Create a basic Apache Spark cluster in the cloud (in 5 minutes)

Apache Spark in a few words Apache Spark is a software and data science platform that is purpose-built for large- to massive-scale data processing. Spark supports processing of data in batch mode (run as a pipeline) or in interactive mode using command-line programming style or in popular notebook style of coding. While scala is the native language for Spark, language bindings exist for python, R, and Java as well.

Leveraging Bioconductor for somatic variant analysis of TCGA data

The NCI Genomic Data Commons (GDC) now contains the authoritative source of data from The Cancer Genome Atlas (TCGA) as well as several other projects of import to the cancer research community. One of the available assays produces somatic variant calls, formally identified by comparing tumor reads and normal reads to identify variants relative to the human reference genome that are not present in the normal genome of the patient. Unfortunately, this process for discovering these variants is less precise relative to finding germline variants.

GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation

One of the features of the NCI Genomic Data Commons is that everything has a unique identifier in the form of a UUID. However, because many legacy projects and much of the literature do not use UUIDs but instead use TCGA sample barcodes, one simple use case for the GenomicDataCommons package is to map from the UUID for a file or a set of files back to the associated TCGA barcode(s).

Protect Against Secrets in Git Repositories

I made a mistake and am going to share it here. Please be gentle when judging me. As penance, I spent some time to learn how to systematically avoid making the same mistake and share that solution here. The prelude I had been working on some code that I thought was going to be throw-away example code for loading a large dataset into ElasticSearch. That said, I have been saved often enough by using a version control system (now, always git), that I use it all the time.

November Bioinformatics and Data Science Papers

I am starting to make a short list of papers that interested me for the month. In creating this list, I make no claims about these being the “best” papers, the most interesting, or even that they are “good” papers. The list simply serves as an external brain for me and may include some papers that are of interest to others. Besides the usual single manuscripts, November publications included a complete issue of Cancer Research focused on computational resources.

A computable Bioconductor build report

Bioconductor spends a substantial amount of effort to build its catalog of software each day. Reporting of these results is critical for developers, users, and project leaders to understand the software “health” of the project. The Bioconductor build reports are generally available as html pages that are navigable with bookmarks and link out to detailed reports of errors, etc. However, the build reports are not readily computable, so mining the reports, automated processing by developers, and learning about failure modes automatically is challenging.

Agricultural genomics may benefit from human genomic data and software engineering

As a government employee, I have been given some fantastic opportunities to interact with other government employees and agencies doing really important research in service to the country. Over the past couple of days, I have been attending a great symposium to provide an updated Blueprint for animal genetics and genomics. Discussion was wide-ranging, but largely focused on genomics, informatics, and translation to and from phenotypes. High-throughput phenotyping (think wearables for plants and cows and drone videos of cattle herds) seems like a growth area.

Matched tumor/normal pairs--a use case for the GenomicDataCommons Bioconductor package

Introduction The NCI Genomic Data Commons (GDC) is a reboot of the approach that NCI uses to manage and expose genomic and associated clinical and experimental metadata. I have been working on a Bioconductor package that interfaces with the GDC API to provide search and data retrieval from within R. testing In the first of what will likely be a set of use cases for the GenomicDataCommons, I am going to address a question that came up on twitter from @sleight82