27 Genomic ranges and ATAC-Seq
R / Bioconductor packages used
27.1 Background
Chromatin accessibility assays measure the extent to which DNA is open and accessible. Such assays now use high throughput sequencing as a quantitative readout. DNAse assays, first using microarrays(Crawford, Davis, et al. 2006) and then DNAse-Seq (Crawford, Holt, et al. 2006), requires a larger amount of DNA and is labor-indensive and has been largely supplanted by ATAC-Seq (Buenrostro et al. 2013).
The Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) method maps chromatin accessibility genome-wide. This method quantifies DNA accessibility with a hyperactive Tn5 transposase that cuts and inserts sequencing adapters into regions of chromatin that are accessible. High throughput sequencing of fragments produced by the process map to regions of increased accessibility, transcription factor binding sites, and nucleosome positioning. The method is both fast and sensitive and can be used as a replacement for DNAse and MNase.
An early review of chromatin accessibility assays (Tsompana and Buck 2014) compares the use cases, pros and cons, and expected signals from each of the most common approaches (Figure @ref(fig:chromatinAssays)).
The first manuscript describing ATAC-Seq protocol and findings outlined how ATAC-Seq data “line up” with other datatypes such as ChIP-seq and DNAse-seq (Figure @ref(fig:greenleaf)). They also highlight how fragment length correlates with specific genomic regions and characteristics (Buenrostro et al. 2013, fig. 3).
Buenrostro et al. provide a detailed protocol for performing ATAC-Seq and quality control of results (Buenrostro et al. 2015). Updated and modified protocols that improve on signal-to-noise and reduce input DNA requirements have been described.
27.2 Informatics overview
ATAC-Seq protocols typically utilize paired-end sequencing protocols. The reads are aligned to the respective genome using bowtie2
, BWA
, or other short-read aligner. The result, after appropriate manipulation, often using samtools
, results in a BAM file. Among other details, the BAM format includes columns for:
knitr::include_graphics('imgs/bam_shot.png')
samtools view
is the text format of the BAM file (called SAM format). Bioconductor and many other tools use BAM files for input. Note that BAM files also often include an index .bai
file that enables random access into the file; one can read just a genomic region without having to read the entire file.- sequence name (
chr1
) - start position (integer)
- a CIGAR string that describes the alignment in a compact form
- the sequence to which the pair aligns
- the position to which the pair aligns
- a bit flag field that describes multiple characteristics of the alignment
- the sequence and quality string of the read
- additional tags that tend to be aligner-specific
Duplicate fragments (those with the same start and end position of other reads) are marked and likely discarded. Reads that fail to align “properly” are also often excluded from analysis. It is worth noting that most software packages allow simple “marking” of such reads and that there is usually no need to create a special BAM file before proceeding with downstream work.
After alignment and BAM processing, the workflow can switch to Bioconductor.
27.3 Working with sequencing data in Bioconductor
The Bioconductor project includes several infrastructure packages for dealing with ranges (sequence name, start, end, +/- strand) on sequences (Lawrence et al. 2013) as well as capabilities with working with Fastq files directly (Morgan et al. 2016).
Package | Use cases |
---|---|
Rsamtools | low level access to FASTQ, VCF, SAM, BAM, BCF formats |
GenomicRanges | Container and methods for handling genomic reagions |
GenomicFeatures | Work with transcript databases, gff, gtf and BED formats |
GenomicAlignments | Reader for BAM format |
rtracklayer | import and export multiple UCSC file formats including BigWig and Bed |
As noted in the previous section, the output of an ATAC-Seq experiment is a BAM file. As paired-end sequencing is a commonly-applied approach for ATAC-Seq, the readGAlignmentPairs
function is the appropriate method to use.
28 Data import and quality control
Reading a paired-end BAM file looks a bit complicated, but the following code will:
- Read the included BAM file.
- Include read pairs only (
isPaired = TRUE
) - Include properly paired reads (
isProperPair = TRUE
) - Include reads with mapping quality >= 1
- Add a couple of additional fields,
mapq
(mapping quality) andisize
(insert size) to the default fields.
greenleaf <- readGAlignmentPairs(
"https://github.com/seandavi/RBiocBook/raw/main/atac-seq/extdata/Sorted_ATAC_21_22.bam",
param = ScanBamParam(
mapqFilter = 1,
flag = scanBamFlag(
isPaired = TRUE,
isProperPair = TRUE
),
what = c("mapq", "isize")
)
)
Exercise: What is the class of greenleaf
? Exercise: Use the GenomicAlignments::first()
accessor to get the first read of the pair as a GAlignments
object. Save the result as a variable called gl_first_read
. Use the mcols
accessor to find the “metadata columns” of gl_first_read
. Exercise: How many read pairs map to each chromosome?
We can make plot of the number of reads mapping to each chromosome.
To keep things small, the example BAM file includes only chromosomes 21 and 22.
ggplot(chromCounts, aes(x = chromosome, y = count)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Normalizing by the chromosome length can yield the reads per megabase which should crudely be similar across all chromosomes.
chromCounts <- chromCounts %>%
dplyr::mutate(readsPerMb = (count / (seqlengths(greenleaf) / 1e6)))
And show a plot. For two chromosomes, this is a little underwhelming.
ggplot(chromCounts, aes(x = chromosome, y = readsPerMb)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_bw()
28.1 Coverage
The coverage
method for genomic ranges calculates, for each base, the number of overlapping features. In the case of a BAM file from ATAC-Seq converted to a GAlignmentPairs object, the coverage gives us an idea of the extent to which reads pile up to form peaks.
The coverage is returned as a SimpleRleList
object. Using names
can get us the names of the elements of the list.
names(cvg)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
There is a name for each chromosome. Looking at the chr21
entry:
cvg$chr21
integer-Rle of length 48129895 with 397462 runs
Lengths: 9411376 50 11 50 ... 36 14 28 10806
Values : 0 2 0 2 ... 1 2 1 0
we see that each chromosome is represented as an Rle
, short for run-length-encoding. Simply put, since along the chromosome there are many repeated values, we can recode the long vector as a set of (length: value) pairs. For example, if the first 9,410,000 base pairs have 0
coverage, we encode that as (9,410,000: 0). Doing that across the chromosome can very significantly reduce the memory use for genomic coverage.
The following little function, plotCvgHistByChrom
can plot a histogram of the coverage for a chromosome.
plotCvgHistByChrom <- function(cvg, chromosome) {
library(ggplot2)
cvgcounts <- as.data.frame(table(cvg[[chromosome]]))
cvgcounts[, 1] <- as.numeric(as.character(cvgcounts[, 1]))
colnames(cvgcounts) <- c("Coverage", "Count")
ggplot(cvgcounts, aes(x = Coverage, y = Count)) +
ggtitle(paste("Chromosome", chromosome)) +
geom_point(alpha = 0.5) +
geom_smooth(span = 0.2) +
scale_y_log10() +
theme_bw()
}
for (i in c("chr21", "chr22")) {
print(plotCvgHistByChrom(cvg, i))
}
28.2 Fragment Lengths
The first ATAC-Seq manuscript (Buenrostro et al. 2013) highlighted the relationship between fragment length and nucleosomes (see Figure @ref{fig:flgreenleaf}).
knitr::include_graphics("https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ad9/3959825/fde39a9fb288/nihms554473f2.jpg")
Remember that we loaded the example BAM file with insert sizes (isize
). We can use that “column” to examine the fragment lengths (another name for insert size). Also, note that the insert size for the first
read and the second
are the same (absolute value). Here, we will use first
.
We can plot the fragment length density (histogram) using the density
function.
And for fun, the ggplot2 version:
library(dplyr)
library(ggplot2)
fragLenPlot <- table(fraglengths) %>%
data.frame() |>
rename(
InsertSize = fraglengths,
Count = Freq
) |>
mutate(
InsertSize = as.numeric(as.vector(InsertSize)),
Count = as.numeric(as.vector(Count))
) |>
ggplot(aes(x = InsertSize, y = Count)) +
geom_line()
print(fragLenPlot + theme_bw() + lims(x = c(-1, 500)))
Knowing that the nucleosome-free regions will have insert sizes shorter than one nucleosome, we can isolate the read pairs that have that characteristic.
And the mononucleosome reads will be between 147 and 294 base pairs for insert size/fragment length.
Finally, we expect nucleosome-free reads to be enriched near the TSS while mononucleosome reads should not be. We will use the heatmaps package to take a look at these two sets of reads with respect to the tss of the human genome.
Take a look at the heatmaps package vignette to learn more about the heatmaps package capabilities.
gl_mn_hm <- CoverageHeatmap(proms, coverage(gl_mn), coords = c(-250, 250))
label(gl_mn_hm) <- "MonoNuc"
scale(gl_mn_hm) <- c(0, 10)
plotHeatmapMeta(list(gl_nf_hm, gl_mn_hm))
plotHeatmapList(list(gl_mn_hm, gl_nf_hm))
28.3 Viewing data in IGV
Install IGV from here.
We export the greenleaf data as a BigWig file.
-
Exercise: In IGV, choose
hg19
. Then, load thegreenleaf.bw
file and explore chromosomes 21 and 22. - Exercise: Export the nucleosome-free portion of the data as a BigWig file and examine that in IGV. Where do you expect to see the strongest signals?
28.4 Additional work
For those working extensively on ATAC-Seq, there is a great workflow/tutorial available from Thomas Carrol:
Feel free to work through it. In addition to the work above, there is also the ATACseqQC package vignette that offers more than just QC. At least a couple more packages are available in Bioconductor.
MACS2
The MACS2 package is a commonly-used package for calling peaks. Installation and other details are available1.
1 https://github.com/taoliu/MACS
pip install macs2