data science

Container Notes

General notes about using containers

Bioconductor: software for interpreting high-throughput biological data

This talk presents a very quick overview of the Bioconductor project, focusing on its values of reproducibility, reuse, and openness.

Practical Data Science and Informatics Training

This talk compares and contrasts four formats for data science and informatics education. The discussion will highlight some approaches that I have found useful to facilitate the training process. I also present some practical and simple tips that I …

The cancer data ecosystem: data and cloud resources for cancer genomic data science

In this talk, I motivate the need for cloud-based cancer data resourdces. I provide an overview of the NCI Genomic Data Commons and how to interact with it both interactively through a web portal as well as programmatically using the …

Create a basic Apache Spark cluster in the cloud (in 5 minutes)

Apache Spark in a few words Apache Spark is a software and data science platform that is purpose-built for large- to massive-scale data processing. Spark supports processing of data in batch mode (run as a pipeline) or in interactive mode using command-line programming style or in popular notebook style of coding. While scala is the native language for Spark, language bindings exist for python, R, and Java as well.

Matched tumor/normal pairs--a use case for the GenomicDataCommons Bioconductor package

Introduction The NCI Genomic Data Commons (GDC) is a reboot of the approach that NCI uses to manage and expose genomic and associated clinical and experimental metadata. I have been working on a Bioconductor package that interfaces with the GDC API to provide search and data retrieval from within R. testing In the first of what will likely be a set of use cases for the GenomicDataCommons, I am going to address a question that came up on twitter from @sleight82