Sean Davis @ NCI

Create a basic Apache Spark cluster in the cloud (in 5 minutes)

Apache Spark in a few words Apache Spark is a software and data science platform that is purpose-built for large- to massive-scale data processing. Spark supports processing of data in batch mode (run as a pipeline) or in interactive mode using command-line programming style or in popular notebook style of coding. While scala is the native language for Spark, language bindings exist for python, R, and Java as well.

Leveraging Bioconductor for somatic variant analysis of TCGA data

The NCI Genomic Data Commons (GDC) now contains the authoritative source of data from The Cancer Genome Atlas (TCGA) as well as several other projects of import to the cancer research community. One of the available assays produces somatic variant calls, formally identified by comparing tumor reads and normal reads to identify variants relative to the human reference genome that are not present in the normal genome of the patient. Unfortunately, this process for discovering these variants is less precise relative to finding germline variants.

GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation

One of the features of the NCI Genomic Data Commons is that everything has a unique identifier in the form of a UUID. However, because many legacy projects and much of the literature do not use UUIDs but instead use TCGA sample barcodes, one simple use case for the GenomicDataCommons package is to map from the UUID for a file or a set of files back to the associated TCGA barcode(s).

Protect Against Secrets in Git Repositories

I made a mistake and am going to share it here. Please be gentle when judging me. As penance, I spent some time to learn how to systematically avoid making the same mistake and share that solution here. The prelude I had been working on some code that I thought was going to be throw-away example code for loading a large dataset into ElasticSearch. That said, I have been saved often enough by using a version control system (now, always git), that I use it all the time.

November Bioinformatics and Data Science Papers

I am starting to make a short list of papers that interested me for the month. In creating this list, I make no claims about these being the “best” papers, the most interesting, or even that they are “good” papers. The list simply serves as an external brain for me and may include some papers that are of interest to others. Besides the usual single manuscripts, November publications included a complete issue of Cancer Research focused on computational resources.

A computable Bioconductor build report

Bioconductor spends a substantial amount of effort to build its catalog of software each day. Reporting of these results is critical for developers, users, and project leaders to understand the software “health” of the project. The Bioconductor build reports are generally available as html pages that are navigable with bookmarks and link out to detailed reports of errors, etc. However, the build reports are not readily computable, so mining the reports, automated processing by developers, and learning about failure modes automatically is challenging.

Agricultural genomics may benefit from human genomic data and software engineering

As a government employee, I have been given some fantastic opportunities to interact with other government employees and agencies doing really important research in service to the country. Over the past couple of days, I have been attending a great symposium to provide an updated Blueprint for animal genetics and genomics. Discussion was wide-ranging, but largely focused on genomics, informatics, and translation to and from phenotypes. High-throughput phenotyping (think wearables for plants and cows and drone videos of cattle herds) seems like a growth area.

Approaches to accessing data

I have been attending the biannual Clinical Informatics for Cancer Centers (CI4CC) conference and there has been a fair amount of dicussion of as a resource for enhancing patient engagement, trial recruitment, and results reporting. There are a number of approaches to search and access in bulk data. The ones that I address here are: The CTRP RESTful API. The API. The Access to Aggregate Content of ClinicalTrials.

Hello R Markdown

R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see You can embed an R code chunk like this: summary(cars) ## speed dist ## Min. : 4.0 Min. : 2.00 ## 1st Qu.:12.0 1st Qu.: 26.00 ## Median :15.0 Median : 36.00 ## Mean :15.4 Mean : 42.98 ## 3rd Qu.