Posts

Build and deploy an NCBI GEO metadata fetch API

Build, containerize, and then deploy a simple serverless web API that returns json-formated metadata for any GEO accession.

Last updated on Jun 4, 2020 5 min read cloud, bioinformatics

Build and deploy an NCBI GEO metadata fetch API

Building R Binary Packages for Linux

Background One of the challenges of producing a performant build environment for linux, such as what might be used to have developers test software in identical environments, is the need to compile R packages from source on linux. If, however, one had an identical set of installed libraries, kernel version, compiler, etc., we could use binary packages in linux as well. Docker provides just such a shareable and identical environment for linux.

Last updated on Jun 4, 2020 4 min read Bioconductor, R

Building R Binary Packages for Linux

Experimenting with Github Actions

GitHub actions allow flexible and potentially complicated actions that comprise workflows that respond to events on Github. Continuous integration, messaging Slack, greeting new contributors, deploying applications, and many other templates are ready for customization and integration into any repo.

Last updated on Oct 11, 2019 5 min read IT, programming

Experimenting with Github Actions

OmicIDX on BigQuery

Availability: This ipython notebook is available at https://github.com/seandavi/omicidx_examples. OmicIDX is a project to democratize access to omics metadata. As the sizes of omics repositories have grown into the millions of available samples, thinking of the metadata themselves as Big Data seems reasonable. Additionally, by making the metadata more fit-for-use for text mining, natural language processing, ingestion into machine learning or search engines, OmicIDX aims to facilitate augmentation and analysis of these metadata.

Last updated on Oct 5, 2019 4 min read

Using directory-local variables to customize the emacs project experience

I use emacs for nearly all my editing and interactive analysis. As one typically does, more than one project is the norm, not the exception. Discovering projectile for project-specific buffers and controls, combined with helm for very fast, fuzzy completions, makes emacs a very convenient and efficient environment for most task. One challenge I ran into was the need to have multiple interactive python buffers, typically one per project. However, the out-of-the-box behavior of python-mode is to have only one python interactive buffer named “Python”.

Last updated on Jun 4, 2020 2 min read Notes

Infrastructure-as-Code: Building the Bioconductor Conference AMI With Packer

One of the main features of the annual Bioconductor Conference is the proportion of time spent working with code in the form of workshops. To support these workshops, we ask workshop presenters to supply Rmarkdown materials which we collate into workshop materials. Using literate programming approaches like Rmarkdown ensures that the workflows are self-consistent and work as expected. In addition to the Rmarkdown workshop materials, we also need a consistent computing environment that can support reasonably large computation, provide high-performance network and file system access, and is essentially unlimited in scale (we expect to have >150 participants, each with his/her own machine).

Last updated on Nov 3, 2018 4 min read IT

Extracting Clinical Information Using the Genomicdatacommons Package

This short post introducds the gdc_clinical() function recently added to the GenomicDataCommons package. The rich data model at the NCI Genomic Data Commons (GDC) includes clinical and biospecimen details. A recently added feature to the NCI GDC Data Portal is the ability to download tab-delimited files or JSON files for clinical and biospecimen details of samples. The details available in these simplified formats are also available via the GDC API.

Last updated on Nov 3, 2018 5 min read Bioconductor

Create a basic Apache Spark cluster in the cloud (in 5 minutes)

Apache Spark in a few words Apache Spark is a software and data science platform that is purpose-built for large- to massive-scale data processing. Spark supports processing of data in batch mode (run as a pipeline) or in interactive mode using command-line programming style or in popular notebook style of coding. While scala is the native language for Spark, language bindings exist for python, R, and Java as well.

Last updated on Nov 3, 2018 4 min read Data Science

Leveraging Bioconductor for somatic variant analysis of TCGA data

The NCI Genomic Data Commons (GDC) now contains the authoritative source of data from The Cancer Genome Atlas (TCGA) as well as several other projects of import to the cancer research community. One of the available assays produces somatic variant calls, formally identified by comparing tumor reads and normal reads to identify variants relative to the human reference genome that are not present in the normal genome of the patient.

Last updated on Nov 3, 2018 4 min read Bioconductor

GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation

One of the features of the NCI Genomic Data Commons is that everything has a unique identifier in the form of a UUID. However, because many legacy projects and much of the literature do not use UUIDs but instead use TCGA sample barcodes, one simple use case for the GenomicDataCommons package is to map from the UUID for a file or a set of files back to the associated TCGA barcode(s).

Last updated on Nov 3, 2018 3 min read