28 Ranges Exercises

Author

Affiliation

University of Colorado
Anschutz School of Medicine

Published

June 1, 2024

Modified

June 4, 2025

In the following exercises, we will use the GenomicRanges package to explore range operations. We will use the AnnotationHub package to load DNAse hypersensitivity data from the ENCODE project. In practice, the ENCODE project published datasets like these as bed files. AnnotationHub has packaged these into GRanges objects that we can load and use directly. However, if you have a bed file of your own (peak calls, enhancer regions, etc.), you can load them into GRanges objects using rtracklayer::import.

28.1 Exercise 1

In this exercise, we will use DNAse hypersensitivity data to practice working with a GRanges object.

Use the AnnotationHub package to find the goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz GRanges object. Load that into R as the variable dnase.

Show the answer

library(AnnotationHub)
ah = AnnotationHub()
query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")
# the thing above should have only one record, so we can 
# just grab it
dnase = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")[[1]]

What type of object is dnase?

Show the answer

dnase
class(dnase)

What metadata is stored in dnase?

Show the answer

mcols(dnase)

How many peaks are on each chromosome?

Show the answer

library(GenomicFeatures)
table(seqnames(dnase))

What are the mean, min, max, and median widths of the peaks?

Show the answer

summary(width(dnase))

What are the sequences that were used in the analysis? Do the names have “chr” or not? Experiment with changing the seqlevelsStyle to adjust the sequence names.

Show the answer

seqlevels(dnase)
seqlevelsStyle(dnase)
seqlevelsStyle(dnase) = 'ensembl'
seqlevelsStyle(dnase)
seqlevels(dnase)

What is the total amount of “landscape” covered by the peaks? Assume that the peaks do not overlap. What portion of the genome does this represent?

Show the answer

sum(width(dnase))
sum(seqlengths(dnase))
sum(width(dnase))/sum(seqlengths(dnase))

28.2 Exercise 2

In this exercise, we are going to load the second DNAse hypersensitivity replicate to investigate overlaps with the first replicate.

Use the AnnotationHub to find the second replicate, goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz. Load that as dnase2.

Show the answer

query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")
# the thing above should have only one record, so we can 
# just grab it
dnase2 = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")[[1]]

How many peaks are there in dnase and dnase2? Are there are similar number?

Show the answer

length(dnase)
length(dnase2)

What are the peak sizes for dnase2?

Show the answer

summary(width(dnase2))

What proportion of the genome does dnase2 cover?

Show the answer

sum(width(dnase))/sum(seqlengths(dnase))

Count the number of peaks from dnase that overlap with dnase2.

Show the answer

sum(dnase %over% dnase2)

Assume that your peak-caller was “too specific” and that you want to expand your peaks by 50 bp on each end (so make them 100 bp larger). Use a combination of resize (and pay attention to the fix argument) and width to do this expansion to dnase and call the new GRanges object “dnase_wide”.

Show the answer

w = width(dnase)
dnase_wide = resize(dnase, width=w+100, fix='center') #make a copy
width(dnase_wide)

28.3 Exercise 3

In this exercise, we are going to look at the overlap of DNAse sites relative to genes. To get started, install and load the TxDb.Hsapiens.UCSC.hg19.knownGene txdb object.

BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
kg = TxDb.Hsapiens.UCSC.hg19.knownGene

Load the transcripts from the knownGene txdb into a variable. What is the class of this object?

Show the answer

library("TxDb.Hsapiens.UCSC.hg19.knownGene")
kg = TxDb.Hsapiens.UCSC.hg19.knownGene
gx = genes(kg)
class(gx)
length(gx)

Read about the flank method for GRanges objects. How could you use that to get the “promoter” regions of the transcripts? Let’s assume that the promoter region is 2kb upstream of the gene.

Show the answer

flank(gx,2000)

Instead of using flank, could you do the same thing with the TxDb object? (See ?promoters).

Show the answer

proms = promoters(kg)

Do any of the regions in the promoters overlap with each other?

Show the answer

summary(countOverlaps(proms))

To find overlap of our DNAse sites with promoters, let’s collapse overlapping “promoters” to just keep the contiguous regions by using reduce.

Show the answer

# reduce takes all overlapping regions and collapses them
# into a single region that spans all of the overlapping regions
prom_regions = reduce(proms)

# now we can check for overlaps
summary(countOverlaps(prom_regions))

Count the number of DNAse sites that overlap with our promoter regions.

Show the answer

sum(dnase %over% prom_regions)
# if you notice no overlap, check the seqlevels
# and seqlevelsStyle
seqlevelsStyle(dnase) = "UCSC"
sum(dnase %over% prom_regions)
sum(dnase2 %over% prom_regions)

Is this surprising? If we were to assume that the promoter and dnase regions are “independent” of each other, what number of overlaps would we expect?

Show the answer

prop_proms = sum(width(prom_regions))/sum(seqlengths(prom_regions))
prop_dnase = sum(width(dnase))/sum(seqlengths(prom_regions))
# Iff the dnase and promoter regions are 
# not related, then we would expect this number
# of DNAse overlaps with promoters.
prop_proms * prop_dnase * length(dnase)

28.4 Exercise 4

We’ll be using data from histone modification ChIP-seq experiments in human cells to illustrate the concepts of genomic ranges and features. The data consists of genomic intervals representing regions of the genome where specific histone modifications are enriched. These intervals are typically identified using ChIP-seq, a technique that maps protein-DNA interactions across the genome.

The ChIP-seq data is stored in a BED file format, which is a tab-delimited text file format commonly used to represent genomic intervals. Each line in the BED file corresponds to a genomic interval and contains information about the chromosome, start and end positions, and strand orientation of the interval. Additional columns may include metadata such as the signal strength or significance of the interval.

The AnnotationHub package in Bioconductor provides access to a wide range of genomic datasets, including ChIP-seq data. We can use this package to retrieve the ChIP-seq data for histone modifications in human cells and convert it into a GenomicRanges object for further analysis.

https://www.encodeproject.org/chip-seq/histone/

Let’s start by loading the AnnotationHub package and retrieving the ChIP-seq data for histone modifications in human cells. You can read more about the AnnotationHub package and how to use it in the Bioconductor documentation.

Show the answer

library(AnnotationHub)
ah <- AnnotationHub()

There are multiple ways to search the AnnotationHub database. We’ve done that for you and here are the GRanges objects for each of four histone marks, and one histone mark replicate.

Show the answer

h3k4me1 <- ah[['AH25832']]
h3k4me3 <- ah[['AH25833']]
h3k9ac <- ah[['AH25834']]
h3k27me3 <- ah[['AH25835']]
h3k4me3_2 <- ah[['AH27284']]

Each of these variables now represents the peak calls after a chip-seq experiment pulling down the histone mark of interest. In the encode project these records were bed files. The bed files have been converted to GRanges objects to allow computation within R.

Show the answer

# Grab cpg islands as well
cpg = query(ah, c('cpg','UCSC','hg19'))[[1]]

Let’s say that we don’t know the behavior of the histone methylation marks with respect to CpG islands. We could ask the question, “What is the overlap of the histone peaks with CpG islands?”

Show the answer

sum(h3k4me1 %over% cpg)

We might want to actually count the number of bases of overlap between the methyl mark and CpG islands.

Show the answer

# The intersection of two peak sets results in the 
# overlapping regions as a new set of regions
# The width of each peak is the number of overlapping bases
# And the sum of the widths is the total bases overlapping
sum(width(intersect(h3k4me1, cpg)))

But some methyl marks are known to have very broad signals, meaning that there is a higher chance of overlapping CpG islands just because there are more methylated bases. We can adjust for this by “normalizing” for all possible bases covered by either set of peaks, using union. We might think of this as a sort of “enrichment score” of one set in another set.

Show the answer

sum(width(union(h3k4me1, cpg)))
# and now "normalize" 
sum(width(intersect(h3k4me1, cpg)))/sum(width(union(h3k4me1, cpg)))

Let’s write a small function to calculate our little enrichment score.

Show the answer

range_enrichment_score <- function(r1, r2) {
  i = sum(width(intersect(r1, r2)))
  u = sum(width(union(r1,r2)))
  return(i/u)
}

And give it a try:

Show the answer

range_enrichment_score(h3k4me1, cpg)

--- title: "Ranges Exercises" format: html: code-fold: true code-summary: "Show the answer" code-tools: true --- ```{r init, include=FALSE} library(knitr) opts_chunk$set(cache=TRUE, message=FALSE, results='hide') ``` In the following exercises, we will use the `GenomicRanges` package to explore range operations. We will use the `AnnotationHub` package to load DNAse hypersensitivity data from the ENCODE project. In practice, the ENCODE project published datasets like these as bed files. AnnotationHub has packaged these into GRanges objects that we can load and use directly. However, if you have a bed file of your own (peak calls, enhancer regions, etc.), you can load them into GRanges objects using `rtracklayer::import`. ## Exercise 1 In this exercise, we will use DNAse hypersensitivity data to practice working with a GRanges object. - Use the `AnnotationHub` package to find the `goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz` `GRanges` object. Load that into R as the variable dnase. ```{r dnase} library(AnnotationHub) ah = AnnotationHub() query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz") # the thing above should have only one record, so we can # just grab it dnase = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")[[1]] ``` - What type of object is dnase? ```{r} dnase class(dnase) ``` - What metadata is stored in dnase? ```{r} mcols(dnase) ``` - How many peaks are on each chromosome? ```{r} library(GenomicFeatures) table(seqnames(dnase)) ``` - What are the mean, min, max, and median widths of the peaks? ```{r} summary(width(dnase)) ``` - What are the sequences that were used in the analysis? Do the names have "chr" or not? Experiment with changing the `seqlevelsStyle` to adjust the sequence names. ```{r} seqlevels(dnase) seqlevelsStyle(dnase) seqlevelsStyle(dnase) = 'ensembl' seqlevelsStyle(dnase) seqlevels(dnase) ``` - What is the total amount of "landscape" covered by the peaks? Assume that the peaks do not overlap. What portion of the genome does this represent? ```{r} sum(width(dnase)) sum(seqlengths(dnase)) sum(width(dnase))/sum(seqlengths(dnase)) ``` ## Exercise 2 In this exercise, we are going to load the second DNAse hypersensitivity replicate to investigate overlaps with the first replicate. - Use the AnnotationHub to find the second replicate, `goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz`. Load that as `dnase2`. ```{r} query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz") # the thing above should have only one record, so we can # just grab it dnase2 = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")[[1]] ``` - How many peaks are there in `dnase` and `dnase2`? Are there are similar number? ```{r} length(dnase) length(dnase2) ``` - What are the peak sizes for `dnase2`? ```{r} summary(width(dnase2)) ``` - What proportion of the genome does `dnase2` cover? ```{r} sum(width(dnase))/sum(seqlengths(dnase)) ``` - Count the number of peaks from `dnase` that overlap with `dnase2`. ```{r} sum(dnase %over% dnase2) ``` - Assume that your peak-caller was "too specific" and that you want to expand your peaks by 50 bp on each end (so make them 100 bp larger). Use a combination of `resize` (and pay attention to the `fix` argument) and `width` to do this expansion to dnase and call the new `GRanges` object "dnase_wide". ```{r} w = width(dnase) dnase_wide = resize(dnase, width=w+100, fix='center') #make a copy width(dnase_wide) ``` ## Exercise 3 In this exercise, we are going to look at the overlap of DNAse sites relative to genes. To get started, install and load the `TxDb.Hsapiens.UCSC.hg19.knownGene` txdb object. ``` BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene") library("TxDb.Hsapiens.UCSC.hg19.knownGene") kg = TxDb.Hsapiens.UCSC.hg19.knownGene ``` - Load the transcripts from the knownGene txdb into a variable. What is the class of this object? ```{r} library("TxDb.Hsapiens.UCSC.hg19.knownGene") kg = TxDb.Hsapiens.UCSC.hg19.knownGene gx = genes(kg) class(gx) length(gx) ``` - Read about the `flank` method for GRanges objects. How could you use that to get the "promoter" regions of the transcripts? Let's assume that the promoter region is 2kb upstream of the gene. ```{r} flank(gx,2000) ``` - Instead of using flank, could you do the same thing with the TxDb object? (See `?promoters`). ```{r} proms = promoters(kg) ``` - Do any of the regions in the promoters overlap with each other? ```{r} summary(countOverlaps(proms)) ``` - To find overlap of our DNAse sites with promoters, let's collapse overlapping "promoters" to just keep the contiguous regions by using `reduce`. ```{r} # reduce takes all overlapping regions and collapses them # into a single region that spans all of the overlapping regions prom_regions = reduce(proms) # now we can check for overlaps summary(countOverlaps(prom_regions)) ``` - Count the number of DNAse sites that overlap with our promoter regions. ```{r} sum(dnase %over% prom_regions) # if you notice no overlap, check the seqlevels # and seqlevelsStyle seqlevelsStyle(dnase) = "UCSC" sum(dnase %over% prom_regions) sum(dnase2 %over% prom_regions) ``` - Is this surprising? If we were to assume that the promoter and dnase regions are "independent" of each other, what number of overlaps would we expect? ```{r} prop_proms = sum(width(prom_regions))/sum(seqlengths(prom_regions)) prop_dnase = sum(width(dnase))/sum(seqlengths(prom_regions)) # Iff the dnase and promoter regions are # not related, then we would expect this number # of DNAse overlaps with promoters. prop_proms * prop_dnase * length(dnase) ``` ## Exercise 4 We'll be using data from histone modification ChIP-seq experiments in human cells to illustrate the concepts of genomic ranges and features. The data consists of genomic intervals representing regions of the genome where specific histone modifications are enriched. These intervals are typically identified using ChIP-seq, a technique that maps protein-DNA interactions across the genome. The ChIP-seq data is stored in a BED file format, which is a tab-delimited text file format commonly used to represent genomic intervals. Each line in the BED file corresponds to a genomic interval and contains information about the chromosome, start and end positions, and strand orientation of the interval. Additional columns may include metadata such as the signal strength or significance of the interval. The AnnotationHub package in Bioconductor provides access to a wide range of genomic datasets, including ChIP-seq data. We can use this package to retrieve the ChIP-seq data for histone modifications in human cells and convert it into a GenomicRanges object for further analysis. <https://www.encodeproject.org/chip-seq/histone/> Let's start by loading the AnnotationHub package and retrieving the ChIP-seq data for histone modifications in human cells. You can read more about the AnnotationHub package and how to use it in the Bioconductor documentation. ```{r} library(AnnotationHub) ah <- AnnotationHub() ``` There are multiple ways to search the AnnotationHub database. We've done that for you and here are the `GRanges` objects for each of four histone marks, and one histone mark replicate. ```{r} h3k4me1 <- ah[['AH25832']] h3k4me3 <- ah[['AH25833']] h3k9ac <- ah[['AH25834']] h3k27me3 <- ah[['AH25835']] h3k4me3_2 <- ah[['AH27284']] ``` Each of these variables now represents the peak calls after a chip-seq experiment pulling down the histone mark of interest. In the encode project these records were `bed` files. The bed files have been converted to `GRanges` objects to allow computation within R. ```{r} # Grab cpg islands as well cpg = query(ah, c('cpg','UCSC','hg19'))[[1]] ``` Let's say that we don't know the behavior of the histone methylation marks with respect to CpG islands. We could ask the question, "What is the overlap of the histone peaks with CpG islands?" ```{r} sum(h3k4me1 %over% cpg) ``` We might want to actually count the number of bases of overlap between the methyl mark and CpG islands. ```{r} # The intersection of two peak sets results in the # overlapping regions as a new set of regions # The width of each peak is the number of overlapping bases # And the sum of the widths is the total bases overlapping sum(width(intersect(h3k4me1, cpg))) ``` But some methyl marks are known to have very broad signals, meaning that there is a higher chance of overlapping CpG islands just because there are more methylated bases. We can adjust for this by "normalizing" for all possible bases covered by either set of peaks, using `union`. We might think of this as a sort of "enrichment score" of one set in another set. ```{r} sum(width(union(h3k4me1, cpg))) # and now "normalize" sum(width(intersect(h3k4me1, cpg)))/sum(width(union(h3k4me1, cpg))) ``` Let's write a small function to calculate our little enrichment score. ```{r} range_enrichment_score <- function(r1, r2) { i = sum(width(intersect(r1, r2))) u = sum(width(union(r1,r2))) return(i/u) } ``` And give it a try: ```{r} range_enrichment_score(h3k4me1, cpg) ```