20  Ranges Exercises

Published

July 26, 2024

In the following exercises, we will use the GenomicRanges package to explore range operations. We will use the AnnotationHub package to load DNAse hypersensitivity data from the ENCODE project. In practice, the ENCODE project published datasets like these as bed files. AnnotationHub has packaged these into GRanges objects that we can load and use directly. However, if you have a bed file of your own (peak calls, enhancer regions, etc.), you can load them into GRanges objects using rtracklayer::import.

20.1 Exercise 1

In this exercise, we will use DNAse hypersensitivity data to practice working with a GRanges object.

  • Use the AnnotationHub package to find the goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz GRanges object. Load that into R as the variable dnase.
Show the answer
library(AnnotationHub)
ah = AnnotationHub()
query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")
# the thing above should have only one record, so we can 
# just grab it
dnase = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")[[1]]
  • What type of object is dnase?
Show the answer
dnase
class(dnase)
  • What metadata is stored in dnase?
Show the answer
mcols(dnase)
  • How many peaks are on each chromosome?
Show the answer
  • What are the mean, min, max, and median widths of the peaks?
Show the answer
summary(width(dnase))
  • What are the sequences that were used in the analysis? Do the names have “chr” or not? Experiment with changing the seqlevelsStyle to adjust the sequence names.
Show the answer
seqlevels(dnase)
seqlevelsStyle(dnase)
seqlevelsStyle(dnase) = 'ensembl'
seqlevelsStyle(dnase)
seqlevels(dnase)
  • What is the total amount of “landscape” covered by the peaks? Assume that the peaks do not overlap. What portion of the genome does this represent?
Show the answer
sum(width(dnase))
sum(seqlengths(dnase))
sum(width(dnase))/sum(seqlengths(dnase))

20.2 Exercise 2

In this exercise, we are going to load the second DNAse hypersensitivity replicate to investigate overlaps with the first replicate.

  • Use the AnnotationHub to find the second replicate, goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz. Load that as dnase2.
Show the answer
query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")
# the thing above should have only one record, so we can 
# just grab it
dnase2 = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")[[1]]
  • How many peaks are there in dnase and dnase2? Are there are similar number?
Show the answer
length(dnase)
length(dnase2)
  • What are the peak sizes for dnase2?
Show the answer
summary(width(dnase2))
  • What proportion of the genome does dnase2 cover?
Show the answer
sum(width(dnase))/sum(seqlengths(dnase))
  • Count the number of peaks from dnase that overlap with dnase2.
Show the answer
sum(dnase %over% dnase2)
  • Assume that your peak-caller was “too specific” and that you want to expand your peaks by 50 bp on each end (so make them 100 bp larger). Use a combination of resize (and pay attention to the fix argument) and width to do this expansion to dnase and call the new GRanges object “dnase_wide”.
Show the answer
w = width(dnase)
dnase_wide = resize(dnase, width=w+100, fix='center') #make a copy
width(dnase_wide)

20.3 Exercise 3

In this exercise, we are going to look at the overlap of DNAse sites relative to genes. To get started, install and load the TxDb.Hsapiens.UCSC.hg19.knownGene txdb object.

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
kg = TxDb.Hsapiens.UCSC.hg19.knownGene
  • Load the transcripts from the knownGene txdb into a variable. What is the class of this object?
Show the answer
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
kg = TxDb.Hsapiens.UCSC.hg19.knownGene
gx = genes(kg)
class(gx)
length(gx)
  • Read about the flank method for GRanges objects. How could you use that to get the “promoter” regions of the transcripts? Let’s assume that the promoter region is 2kb upstream of the gene.
Show the answer
flank(gx,2000)
  • Instead of using flank, could you do the same thing with the TxDb object? (See ?promoters).
Show the answer
proms = promoters(kg)
  • Do any of the regions in the promoters overlap with each other?
Show the answer
  • To find overlap of our DNAse sites with promoters, let’s collapse overlapping “promoters” to just keep the contiguous regions by using reduce.
Show the answer
# reduce takes all overlapping regions and collapses them
# into a single region that spans all of the overlapping regions
prom_regions = reduce(proms)

# now we can check for overlaps
summary(countOverlaps(prom_regions))
  • Count the number of DNAse sites that overlap with our promoter regions.
Show the answer
sum(dnase %over% prom_regions)
# if you notice no overlap, check the seqlevels
# and seqlevelsStyle
seqlevelsStyle(dnase) = "UCSC"
sum(dnase %over% prom_regions)
sum(dnase2 %over% prom_regions)
  • Is this surprising? If we were to assume that the promoter and dnase regions are “independent” of each other, what number of overlaps would we expect?
Show the answer
prop_proms = sum(width(prom_regions))/sum(seqlengths(prom_regions))
prop_dnase = sum(width(dnase))/sum(seqlengths(prom_regions))
# Iff the dnase and promoter regions are 
# not related, then we would expect this number
# of DNAse overlaps with promoters.
prop_proms * prop_dnase * length(dnase) 

20.4 Exercise 4

We’ll be using data from histone modification ChIP-seq experiments in human cells to illustrate the concepts of genomic ranges and features. The data consists of genomic intervals representing regions of the genome where specific histone modifications are enriched. These intervals are typically identified using ChIP-seq, a technique that maps protein-DNA interactions across the genome.

The ChIP-seq data is stored in a BED file format, which is a tab-delimited text file format commonly used to represent genomic intervals. Each line in the BED file corresponds to a genomic interval and contains information about the chromosome, start and end positions, and strand orientation of the interval. Additional columns may include metadata such as the signal strength or significance of the interval.

The AnnotationHub package in Bioconductor provides access to a wide range of genomic datasets, including ChIP-seq data. We can use this package to retrieve the ChIP-seq data for histone modifications in human cells and convert it into a GenomicRanges object for further analysis.

https://www.encodeproject.org/chip-seq/histone/

Let’s start by loading the AnnotationHub package and retrieving the ChIP-seq data for histone modifications in human cells. You can read more about the AnnotationHub package and how to use it in the Bioconductor documentation.

Show the answer
library(AnnotationHub)
ah <- AnnotationHub()

There are multiple ways to search the AnnotationHub database. We’ve done that for you and here are the GRanges objects for each of four histone marks, and one histone mark replicate.

Show the answer
h3k4me1 <- ah[['AH25832']]
h3k4me3 <- ah[['AH25833']]
h3k9ac <- ah[['AH25834']]
h3k27me3 <- ah[['AH25835']]
h3k4me3_2 <- ah[['AH27284']]

Each of these variables now represents the peak calls after a chip-seq experiment pulling down the histone mark of interest. In the encode project these records were bed files. The bed files have been converted to GRanges objects to allow computation within R.

Show the answer
# Grab cpg islands as well
cpg = query(ah, c('cpg','UCSC','hg19'))[[1]]

Let’s say that we don’t know the behavior of the histone methylation marks with respect to CpG islands. We could ask the question, “What is the overlap of the histone peaks with CpG islands?”

Show the answer
sum(h3k4me1 %over% cpg)

We might want to actually count the number of bases of overlap between the methyl mark and CpG islands.

Show the answer
# The intersection of two peak sets results in the 
# overlapping regions as a new set of regions
# The width of each peak is the number of overlapping bases
# And the sum of the widths is the total bases overlapping
sum(width(intersect(h3k4me1, cpg)))

But some methyl marks are known to have very broad signals, meaning that there is a higher chance of overlapping CpG islands just because there are more methylated bases. We can adjust for this by “normalizing” for all possible bases covered by either set of peaks, using union. We might think of this as a sort of “enrichment score” of one set in another set.

Show the answer
sum(width(union(h3k4me1, cpg)))
# and now "normalize" 
sum(width(intersect(h3k4me1, cpg)))/sum(width(union(h3k4me1, cpg)))

Let’s write a small function to calculate our little enrichment score.

Show the answer
range_enrichment_score <- function(r1, r2) {
  i = sum(width(intersect(r1, r2)))
  u = sum(width(union(r1,r2)))
  return(i/u)
}

And give it a try:

Show the answer
range_enrichment_score(h3k4me1, cpg)