In the following exercises, we will use the GenomicRanges package to explore range operations. We will use the AnnotationHub package to load DNAse hypersensitivity data from the ENCODE project. In practice, the ENCODE project published datasets like these as bed files. AnnotationHub has packaged these into GRanges objects that we can load and use directly. However, if you have a bed file of your own (peak calls, enhancer regions, etc.), you can load them into GRanges objects using rtracklayer::import.
20.1 Exercise 1
In this exercise, we will use DNAse hypersensitivity data to practice working with a GRanges object.
Use the AnnotationHub package to find the goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gzGRanges object. Load that into R as the variable dnase.
Show the answer
library(AnnotationHub)ah=AnnotationHub()query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")# the thing above should have only one record, so we can # just grab itdnase=query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")[[1]]
What are the sequences that were used in the analysis? Do the names have “chr” or not? Experiment with changing the seqlevelsStyle to adjust the sequence names.
In this exercise, we are going to load the second DNAse hypersensitivity replicate to investigate overlaps with the first replicate.
Use the AnnotationHub to find the second replicate, goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz. Load that as dnase2.
Show the answer
query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")# the thing above should have only one record, so we can # just grab itdnase2=query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")[[1]]
How many peaks are there in dnase and dnase2? Are there are similar number?
Assume that your peak-caller was “too specific” and that you want to expand your peaks by 50 bp on each end (so make them 100 bp larger). Use a combination of resize (and pay attention to the fix argument) and width to do this expansion to dnase and call the new GRanges object “dnase_wide”.
Show the answer
w=width(dnase)dnase_wide=resize(dnase, width=w+100, fix='center')#make a copywidth(dnase_wide)
20.3 Exercise 3
In this exercise, we are going to look at the overlap of DNAse sites relative to genes. To get started, install and load the TxDb.Hsapiens.UCSC.hg19.knownGene txdb object.
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
kg = TxDb.Hsapiens.UCSC.hg19.knownGene
Load the transcripts from the knownGene txdb into a variable. What is the class of this object?
Read about the flank method for GRanges objects. How could you use that to get the “promoter” regions of the transcripts? Let’s assume that the promoter region is 2kb upstream of the gene.
To find overlap of our DNAse sites with promoters, let’s collapse overlapping “promoters” to just keep the contiguous regions by using reduce.
Show the answer
# reduce takes all overlapping regions and collapses them# into a single region that spans all of the overlapping regionsprom_regions=reduce(proms)# now we can check for overlapssummary(countOverlaps(prom_regions))
Count the number of DNAse sites that overlap with our promoter regions.
Show the answer
sum(dnase%over%prom_regions)# if you notice no overlap, check the seqlevels# and seqlevelsStyleseqlevelsStyle(dnase)="UCSC"sum(dnase%over%prom_regions)sum(dnase2%over%prom_regions)
Is this surprising? If we were to assume that the promoter and dnase regions are “independent” of each other, what number of overlaps would we expect?
Show the answer
prop_proms=sum(width(prom_regions))/sum(seqlengths(prom_regions))prop_dnase=sum(width(dnase))/sum(seqlengths(prom_regions))# Iff the dnase and promoter regions are # not related, then we would expect this number# of DNAse overlaps with promoters.prop_proms*prop_dnase*length(dnase)
20.4 Exercise 4
We’ll be using data from histone modification ChIP-seq experiments in human cells to illustrate the concepts of genomic ranges and features. The data consists of genomic intervals representing regions of the genome where specific histone modifications are enriched. These intervals are typically identified using ChIP-seq, a technique that maps protein-DNA interactions across the genome.
The ChIP-seq data is stored in a BED file format, which is a tab-delimited text file format commonly used to represent genomic intervals. Each line in the BED file corresponds to a genomic interval and contains information about the chromosome, start and end positions, and strand orientation of the interval. Additional columns may include metadata such as the signal strength or significance of the interval.
The AnnotationHub package in Bioconductor provides access to a wide range of genomic datasets, including ChIP-seq data. We can use this package to retrieve the ChIP-seq data for histone modifications in human cells and convert it into a GenomicRanges object for further analysis.
Let’s start by loading the AnnotationHub package and retrieving the ChIP-seq data for histone modifications in human cells. You can read more about the AnnotationHub package and how to use it in the Bioconductor documentation.
There are multiple ways to search the AnnotationHub database. We’ve done that for you and here are the GRanges objects for each of four histone marks, and one histone mark replicate.
Each of these variables now represents the peak calls after a chip-seq experiment pulling down the histone mark of interest. In the encode project these records were bed files. The bed files have been converted to GRanges objects to allow computation within R.
Show the answer
# Grab cpg islands as wellcpg=query(ah, c('cpg','UCSC','hg19'))[[1]]
Let’s say that we don’t know the behavior of the histone methylation marks with respect to CpG islands. We could ask the question, “What is the overlap of the histone peaks with CpG islands?”
We might want to actually count the number of bases of overlap between the methyl mark and CpG islands.
Show the answer
# The intersection of two peak sets results in the # overlapping regions as a new set of regions# The width of each peak is the number of overlapping bases# And the sum of the widths is the total bases overlappingsum(width(intersect(h3k4me1, cpg)))
But some methyl marks are known to have very broad signals, meaning that there is a higher chance of overlapping CpG islands just because there are more methylated bases. We can adjust for this by “normalizing” for all possible bases covered by either set of peaks, using union. We might think of this as a sort of “enrichment score” of one set in another set.
---title: "Ranges Exercises"format: html: code-fold: true code-summary: "Show the answer" code-tools: true---```{r init, include=FALSE}library(knitr)opts_chunk$set(cache=TRUE, message=FALSE, results='hide')```In the following exercises, we will use the `GenomicRanges` package to explore range operations.We will use the `AnnotationHub` package to load DNAse hypersensitivity datafrom the ENCODE project. In practice, the ENCODE project published datasets like these as bed files. AnnotationHub has packaged these into GRanges objects that we can load and use directly.However, if you have a bed file of your own (peak calls, enhancer regions, etc.), you can load them into GRanges objects using `rtracklayer::import`.## Exercise 1In this exercise, we will use DNAse hypersensitivity data to practice working with a GRanges object.- Use the `AnnotationHub` package to find the `goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz``GRanges` object. Loadthat into R as the variable dnase.```{r dnase}library(AnnotationHub)ah = AnnotationHub()query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")# the thing above should have only one record, so we can # just grab itdnase = query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep1.narrowPeak.gz")[[1]]```- What type of object is dnase?```{r}dnaseclass(dnase)```- What metadata is stored in dnase?```{r}mcols(dnase)```- How many peaks are on each chromosome?```{r}library(GenomicFeatures)table(seqnames(dnase))```- What are the mean, min, max, and median widths of the peaks?```{r}summary(width(dnase))```- What are the sequences that were used in the analysis? Do the names have "chr" or not? Experiment with changing the`seqlevelsStyle` to adjust the sequence names. ```{r}seqlevels(dnase)seqlevelsStyle(dnase)seqlevelsStyle(dnase) ='ensembl'seqlevelsStyle(dnase)seqlevels(dnase)```- What is the total amount of "landscape" covered by the peaks? Assume that the peaks do not overlap. What portionof the genome does this represent?```{r}sum(width(dnase))sum(seqlengths(dnase))sum(width(dnase))/sum(seqlengths(dnase))```## Exercise 2In this exercise, we are going to load the second DNAse hypersensitivity replicate to investigate overlaps withthe first replicate.- Use the AnnotationHub to find the second replicate, `goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz`. Load that as `dnase2`.```{r}query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")# the thing above should have only one record, so we can # just grab itdnase2 =query(ah, "goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562PkRep2.narrowPeak.gz")[[1]]```- How many peaks are there in `dnase` and `dnase2`? Are there are similar number?```{r}length(dnase)length(dnase2)```- What are the peak sizes for `dnase2`?```{r}summary(width(dnase2))```- What proportion of the genome does `dnase2` cover?```{r}sum(width(dnase))/sum(seqlengths(dnase))```- Count the number of peaks from `dnase` that overlap with `dnase2`.```{r}sum(dnase %over% dnase2)```- Assume that your peak-caller was "too specific" and that you want to expand your peaksby 50 bp on each end (so make them 100 bp larger). Use a combination of `resize` (and payattention to the `fix` argument) and `width` to do this expansion to dnase and call thenew `GRanges` object "dnase_wide".```{r}w =width(dnase)dnase_wide =resize(dnase, width=w+100, fix='center') #make a copywidth(dnase_wide)```## Exercise 3In this exercise, we are going to look at the overlap of DNAse sites relative to genes.To get started, install and load the `TxDb.Hsapiens.UCSC.hg19.knownGene` txdb object.```BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")library("TxDb.Hsapiens.UCSC.hg19.knownGene")kg = TxDb.Hsapiens.UCSC.hg19.knownGene```- Load the transcripts from the knownGene txdb into a variable. What is the class of this object?```{r}library("TxDb.Hsapiens.UCSC.hg19.knownGene")kg = TxDb.Hsapiens.UCSC.hg19.knownGenegx =genes(kg)class(gx)length(gx)```- Read about the `flank` method for GRanges objects. How could you use that to get the "promoter" regions of the transcripts? Let's assume that the promoter region is2kb upstream of the gene.```{r}flank(gx,2000)```- Instead of using flank, could you do the same thing with the TxDb object? (See `?promoters`).```{r}proms =promoters(kg)```- Do any of the regions in the promoters overlap with each other? ```{r}summary(countOverlaps(proms))```- To find overlap of our DNAse sites with promoters, let's collapse overlapping "promoters" to just keep the contiguous regions by using `reduce`. ```{r}# reduce takes all overlapping regions and collapses them# into a single region that spans all of the overlapping regionsprom_regions =reduce(proms)# now we can check for overlapssummary(countOverlaps(prom_regions))```- Count the number of DNAse sites that overlap with our promoter regions.```{r}sum(dnase %over% prom_regions)# if you notice no overlap, check the seqlevels# and seqlevelsStyleseqlevelsStyle(dnase) ="UCSC"sum(dnase %over% prom_regions)sum(dnase2 %over% prom_regions)```- Is this surprising? If we were to assume that the promoter and dnase regionsare "independent" of each other, what number of overlaps would we expect?```{r}prop_proms =sum(width(prom_regions))/sum(seqlengths(prom_regions))prop_dnase =sum(width(dnase))/sum(seqlengths(prom_regions))# Iff the dnase and promoter regions are # not related, then we would expect this number# of DNAse overlaps with promoters.prop_proms * prop_dnase *length(dnase) ```## Exercise 4We'll be using data from histone modification ChIP-seq experiments in human cells to illustrate the concepts of genomic ranges and features. The data consists of genomic intervals representing regions of the genome where specific histone modifications are enriched. These intervals are typically identified using ChIP-seq, a technique that maps protein-DNA interactions across the genome.The ChIP-seq data is stored in a BED file format, which is a tab-delimited text file format commonly used to represent genomic intervals. Each line in the BED file corresponds to a genomic interval and contains information about the chromosome, start and end positions, and strand orientation of the interval. Additional columns may include metadata such as the signal strength or significance of the interval.The AnnotationHub package in Bioconductor provides access to a wide range of genomic datasets, including ChIP-seq data. We can use this package to retrieve the ChIP-seq data for histone modifications in human cells and convert it into a GenomicRanges object for further analysis.<https://www.encodeproject.org/chip-seq/histone/>Let's start by loading the AnnotationHub package and retrieving the ChIP-seq data for histone modifications in human cells. You can read more about the AnnotationHub package and how to use it in the Bioconductor documentation.```{r}library(AnnotationHub)ah <-AnnotationHub()```There are multiple ways to search the AnnotationHub database. We've done that for you and here are the `GRanges` objects for each of four histone marks, and one histone mark replicate. ```{r}h3k4me1 <- ah[['AH25832']]h3k4me3 <- ah[['AH25833']]h3k9ac <- ah[['AH25834']]h3k27me3 <- ah[['AH25835']]h3k4me3_2 <- ah[['AH27284']]```Each of these variables now represents the peak calls after a chip-seq experiment pulling down the histone mark of interest. In the encode project these records were `bed` files. The bed files have been converted to `GRanges` objects to allow computation within R. ```{r}# Grab cpg islands as wellcpg =query(ah, c('cpg','UCSC','hg19'))[[1]]```Let's say that we don't know the behavior of the histone methylation marks with respect to CpG islands. We could ask the question, "What is the overlap of the histone peaks with CpG islands?"```{r}sum(h3k4me1 %over% cpg)```We might want to actually count the number of bases of overlap between the methyl mark and CpG islands.```{r}# The intersection of two peak sets results in the # overlapping regions as a new set of regions# The width of each peak is the number of overlapping bases# And the sum of the widths is the total bases overlappingsum(width(intersect(h3k4me1, cpg)))```But some methyl marks are known to have very broad signals, meaning that there is a higher chance of overlapping CpG islands just because there are more methylated bases.We can adjust for this by "normalizing" for all possible bases covered by either set of peaks, using `union`.We might think of this as a sort of "enrichment score" of one set in another set.```{r}sum(width(union(h3k4me1, cpg)))# and now "normalize" sum(width(intersect(h3k4me1, cpg)))/sum(width(union(h3k4me1, cpg)))```Let's write a small function to calculate our little enrichment score.```{r}range_enrichment_score <-function(r1, r2) { i =sum(width(intersect(r1, r2))) u =sum(width(union(r1,r2)))return(i/u)}```And give it a try:```{r}range_enrichment_score(h3k4me1, cpg)```