seandavi(s12)
seandavi(s12)
About
Posts
Talks
Publications
Projects
Contact
Posts
Build and deploy an NCBI GEO metadata fetch API
Build, containerize, and then deploy a simple serverless web API that returns json-formated metadata for any GEO accession.
Last updated on Jun 4, 2020
5 min read
cloud
,
bioinformatics
Building R Binary Packages for Linux
Background One of the challenges of producing a performant build environment for linux, such as what might be used to have developers test software in identical environments, is the need to compile R packages from source on linux. If, however, one had an identical set of installed libraries, kernel version, compiler, etc., we could use binary packages in linux as well. Docker provides just such a shareable and identical environment for linux.
Last updated on Jun 4, 2020
4 min read
Bioconductor
,
R
Experimenting with Github Actions
GitHub actions allow flexible and potentially complicated
actions
that comprise
workflows
that respond to events on Github. Continuous integration, messaging Slack, greeting new contributors, deploying applications, and many other templates are ready for customization and integration into any repo.
Last updated on Oct 11, 2019
5 min read
IT
,
programming
OmicIDX on BigQuery
Availability: This ipython notebook is available at https://github.com/seandavi/omicidx_examples. OmicIDX is a project to democratize access to omics metadata. As the sizes of omics repositories have grown into the millions of available samples, thinking of the metadata themselves as Big Data seems reasonable. Additionally, by making the metadata more fit-for-use for text mining, natural language processing, ingestion into machine learning or search engines, OmicIDX aims to facilitate augmentation and analysis of these metadata.
Last updated on Oct 5, 2019
4 min read
Using directory-local variables to customize the emacs project experience
I use emacs for nearly all my editing and interactive analysis. As one typically does, more than one project is the norm, not the exception. Discovering projectile for project-specific buffers and controls, combined with helm for very fast, fuzzy completions, makes emacs a very convenient and efficient environment for most task. One challenge I ran into was the need to have multiple interactive python buffers, typically one per project. However, the out-of-the-box behavior of python-mode is to have only one python interactive buffer named “Python”.
Last updated on Jun 4, 2020
2 min read
Notes
Infrastructure-as-Code: Building the Bioconductor Conference AMI With Packer
One of the main features of the annual Bioconductor Conference is the proportion of time spent working with code in the form of workshops. To support these workshops, we ask workshop presenters to supply Rmarkdown materials which we collate into workshop materials. Using literate programming approaches like Rmarkdown ensures that the workflows are self-consistent and work as expected. In addition to the Rmarkdown workshop materials, we also need a consistent computing environment that can support reasonably large computation, provide high-performance network and file system access, and is essentially unlimited in scale (we expect to have >150 participants, each with his/her own machine).
Last updated on Nov 3, 2018
4 min read
IT
Extracting Clinical Information Using the Genomicdatacommons Package
This short post introducds the gdc_clinical() function recently added to the GenomicDataCommons package. The rich data model at the NCI Genomic Data Commons (GDC) includes clinical and biospecimen details. A recently added feature to the NCI GDC Data Portal is the ability to download tab-delimited files or JSON files for clinical and biospecimen details of samples. The details available in these simplified formats are also available via the GDC API.
Last updated on Nov 3, 2018
5 min read
Bioconductor
Create a basic Apache Spark cluster in the cloud (in 5 minutes)
Apache Spark in a few words Apache Spark is a software and data science platform that is purpose-built for large- to massive-scale data processing. Spark supports processing of data in batch mode (run as a pipeline) or in interactive mode using command-line programming style or in popular notebook style of coding. While scala is the native language for Spark, language bindings exist for python, R, and Java as well.
Last updated on Nov 3, 2018
4 min read
Data Science
Leveraging Bioconductor for somatic variant analysis of TCGA data
The NCI Genomic Data Commons (GDC) now contains the authoritative source of data from The Cancer Genome Atlas (TCGA) as well as several other projects of import to the cancer research community. One of the available assays produces somatic variant calls, formally identified by comparing tumor reads and normal reads to identify variants relative to the human reference genome that are not present in the normal genome of the patient.
Last updated on Nov 3, 2018
4 min read
Bioconductor
GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation
One of the features of the NCI Genomic Data Commons is that everything has a unique identifier in the form of a UUID. However, because many legacy projects and much of the literature do not use UUIDs but instead use TCGA sample barcodes, one simple use case for the GenomicDataCommons package is to map from the UUID for a file or a set of files back to the associated TCGA barcode(s).
Last updated on Nov 3, 2018
3 min read
»
Cite
×