Selected Publications

Molecular interrogation of a biological sample through DNA sequencing, RNA and microRNA profiling, proteomics and other assays, has the potential to provide a systems level approach to predicting treatment response and disease progression, and to developing precision therapies. Large publicly funded projects have generated extensive and freely available multi-assay data resources; however, bioinformatic and statistical methods for the analysis of such experiments are still nascent. We review multi-assay genomic data resources in the areas of clinical oncology, pharmacogenomics and other perturbation experiments, population genomics and regulatory genomics and other areas, and tools for data acquisition. Finally, we review bioinformatic tools that are explicitly geared toward integrative genomic data visualization and analysis. This review provides starting points for accessing publicly available data and tools to support development of needed integrative methods.
Brief Bioinform

Enhancers regulate spatiotemporal gene expression and impart cell-specific transcriptional outputs that drive cell identity. Super-enhancers (SEs), also known as stretch-enhancers, are a subset of enhancers especially important for genes associated with cell identity and genetic risk of disease. CD4(+) T cells are critical for host defence and autoimmunity. Here we analysed maps of mouse T-cell SEs as a non-biased means of identifying key regulatory nodes involved in cell specification. We found that cytokines and cytokine receptors were the dominant class of genes exhibiting SE architecture in T cells. Nonetheless, the locus encoding Bach2, a key negative regulator of effector differentiation, emerged as the most prominent T-cell SE, revealing a network in which SE-associated genes critical for T-cell biology are repressed by BACH2. Disease-associated single-nucleotide polymorphisms for immune-mediated disorders, including rheumatoid arthritis, were highly enriched for T-cell SEs versus typical enhancers or SEs in other cell lineages. Intriguingly, treatment of T cells with the Janus kinase (JAK) inhibitor tofacitinib disproportionately altered the expression of rheumatoid arthritis risk genes with SE structures. Together, these results indicate that genes with SE architecture in T cells encompass a variety of cytokines and cytokine receptors but are controlled by a 'guardian' transcription factor, itself endowed with an SE. Thus, enumeration of SEs allows the unbiased determination of key regulatory nodes in T cells, which are preferentially modulated by pharmacological intervention.

Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.
Nat Methods

BACKGROUND: Circos is a Perl language based software package for visualizing similarities and differences of genome structure and positional relationships between genomic intervals. Running Circos requires extra data processing procedures to prepare plot data files and configure files from datasets, which limits its capability of integrating directly with other software tools such as R. Recently published R Bioconductor package ggbio provides a function to display genomic data in circular layout based on multiple other packages, which increases its complexity of usage and decreased the flexibility in integrating with other R pipelines.RESULTS: We implemented an R package, RCircos, using only R packages that come with R base installation. The package supports Circos 2D data track plots such as scatter, line, histogram, heatmap, tile, connectors, links, and text labels. Each plot is implemented with a specific function and input data for all functions are data frames which can be objects read from text files or generated with other R pipelines.CONCLUSION: RCircos package provides a simple and flexible way to make Circos 2D track plots with R and could be easily integrated into other R data processing and graphic manipulation pipelines for presenting large-scale multi-sample genomic research data. It can also serve as a base tool to generate complex Circos images.
BMC Bioinformatics

Experimental data have been generated on a huge number of organisms, tissue types, treatment conditions and disease states. The Gene Expression Omnibus, developed by the National Center for Bioinformatics (NCBI) at the National Institutes of Health is a repository of gene expression experiments. The BioConductor project is an open-source and open-development software project built in the R statistical programming environment for the analysis and comprehension of genomic data. The software, called GEOquery, effectively establishes a bridge between GEO and BioConductor. Facilitating analyses and meta-analyses of microarray data will increase the efficiency with which biologically important conclusions can be drawn from published genomic data.
In Bioinformatics

Recent Publications

More Publications

  • GenomicDataCommons: a Bioconductor Interface to the NCI Genomic Data Commons

    Details PDF Code Project

  • RARRES2 functions as a tumor suppressor by promoting β-catenin phosphorylation/degradation and inhibiting p38 phosphorylation in adrenocortical carcinoma.

    Details Pubmed

  • Upregulation of IFN-Inducible and Damage-Response Pathways in Chronic Graft-versus-Host Disease.

    Details Pubmed

  • On the Selective Packaging of Genomic RNA by HIV-1.

    Details Pubmed

  • Point Mutations in Exon 1B of APC Reveal Gastric Adenocarcinoma and Proximal Polyposis of the Stomach as a Familial Adenomatous Polyposis Variant.

    Details Pubmed

  • caOmicsV: an R package for visualizing multidimensional cancer genomic data.

    Details Pubmed

  • Whole Genome Sequencing of Newly Established Pancreatic Cancer Lines Identifies Novel Somatic Mutation (c.2587G>A) in Axon Guidance Receptor Plexin A1 as Enhancer of Proliferation and Invasion.

    Details Pubmed

  • Public data and open source tools for multi-assay genomic investigation of disease.

    Details Pubmed

  • A Genome-Wide Scan Identifies Variants in NFIB Associated with Metastasis in Patients with Osteosarcoma.

    Details Pubmed

  • Inhibition of Survivin with YM155 Induces Durable Tumor Response in Anaplastic Thyroid Cancer.

    Details Pubmed

Recent & Upcoming Talks

Recent Posts

I have been attending the biannual Clinical Informatics for Cancer Centers (CI4CC) conference and there has been a fair amount of dicussion of as a resource for enhancing patient engagement, trial recruitment, and results reporting. There are a number of approaches to search and access in bulk data. The ones that I address here are: The CTRP RESTful API. The API. The Access to Aggregate Content of ClinicalTrials.


The NCI Genomic Data Commons (GDC) is a reboot of the approach that NCI uses to manage and expose genomic and associated clinical and experimental metadata. I have been working on a Bioconductor package that interfaces with the GDC API to provide search and data retrieval from within R. In the first of what will likely be a set of use cases for the GenomicDataCommons, I am going to address a question that came up on twitter from @sleight82


R Markdown This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see You can embed an R code chunk like this: summary(cars) ## speed dist ## Min. : 4.0 Min. : 2.00 ## 1st Qu.:12.0 1st Qu.: 26.00 ## Median :15.0 Median : 36.00 ## Mean :15.4 Mean : 42.98 ## 3rd Qu.



ClinicalTrialsAPI R package

The ClinicalTrialsAPI R package accesses the data in the NIH site via the CTRP REST API.

The GenomicDataCommons Package

The GenomicDataCommons Bioconductor package accesses cancer genomics data from the NCI Genomic Data Commons.

The cRomwell R package

The cRomwell R package runs Workflow Description Language (WDL) workflows on multiple platforms from with R.