21 ATAC-Seq with Bioconductor
Overview
Pre-requisites
This workshop assumes:
- A working and up-to-date version of R
- Basic knowledge of R syntax
- Familiarity with the GenomicRanges package and range manipulations
- Familiarity with BAM files and their contents
Participation
After a very brief review of ATAC-Seq and chromatin accessibility, students will work independently to follow this workflow. Additional materials are provided as links at the end of the workshop for those wanting deeper exposure. Additional materials include alignment from FASTQ files and peak calling.
R / Bioconductor packages used
Time outline
An example for a 45-minute workshop:
Activity | Time |
---|---|
Introduction | 15m |
Independent work | 2-3hr |
Additional exercises (optional, external) | up to 12 hours |
Learning goals
- Describe how to import sequence alignments in BAM format into R
- Relate fragment size to genomic characteristics such as nucleosome occupancy and open chromatin.
- Perform basic alignment manipulations in R to enrich ATAC-seq data for chromatin characteristics.
- Gain familiarity with the IGV genome browser and examining data in genomic context.
- Visualize summaries of genomic signal using profile plots and heatmaps.
Learning objectives
- Load and save genomic data in BAM and BigWig formats [GenomicAlignments and rtracklayer].
- Perform basic QC plots from ATAC-Seq data.
- Isolate nucleosome-free and mononucleosome regions from ATAC-seq data.
- Install and use IGV to visualize data in genomic context.
- Create profile plots using the heatmaps package.
22 Background
Chromatin accessibility assays measure the extent to which DNA is open and accessible. Such assays now use high throughput sequencing as a quantitative readout. DNAse assays, first using microarrays(Crawford, Davis, et al. 2006) and then DNAse-Seq (Crawford, Holt, et al. 2006), requires a larger amount of DNA and is labor-indensive and has been largely supplanted by ATAC-Seq (Buenrostro et al. 2013).
The Assay for Transposase Accessible Chromatin with high-throughput sequencing (ATAC-seq) method maps chromatin accessibility genome-wide. This method quantifies DNA accessibility with a hyperactive Tn5 transposase that cuts and inserts sequencing adapters into regions of chromatin that are accessible. High throughput sequencing of fragments produced by the process map to regions of increased accessibility, transcription factor binding sites, and nucleosome positioning. The method is both fast and sensitive and can be used as a replacement for DNAse and MNase.
An early review of chromatin accessibility assays (Tsompana and Buck 2014) compares the use cases, pros and cons, and expected signals from each of the most common approaches (Figure @ref(fig:chromatinAssays)).
The first manuscript describing ATAC-Seq protocol and findings outlined how ATAC-Seq data “line up” with other datatypes such as ChIP-seq and DNAse-seq (Figure @ref(fig:greenleaf)). They also highlight how fragment length correlates with specific genomic regions and characteristics (Buenrostro et al. 2013, fig. 3).
Buenrostro et al. provide a detailed protocol for performing ATAC-Seq and quality control of results (Buenrostro et al. 2015). Updated and modified protocols that improve on signal-to-noise and reduce input DNA requirements have been described.
22.1 Informatics overview
ATAC-Seq protocols typically utilize paired-end sequencing protocols. The reads are aligned to the respective genome using bowtie2
, BWA
, or other short-read aligner. The result, after appropriate manipulation, often using samtools
, results in a BAM file. Among other details, the BAM format includes columns for:
knitr::include_graphics('imgs/bam_shot.png')
- sequence name (
chr1
) - start position (integer)
- a CIGAR string that describes the alignment in a compact form
- the sequence to which the pair aligns
- the position to which the pair aligns
- a bit flag field that describes multiple characteristics of the alignment
- the sequence and quality string of the read
- additional tags that tend to be aligner-specific
Duplicate fragments (those with the same start and end position of other reads) are marked and likely discarded. Reads that fail to align “properly” are also often excluded from analysis. It is worth noting that most software packages allow simple “marking” of such reads and that there is usually no need to create a special BAM file before proceeding with downstream work.
After alignment and BAM processing, the workflow can switch to Bioconductor.
22.2 Working with sequencing data in Bioconductor
The Bioconductor project includes several infrastructure packages for dealing with ranges (sequence name, start, end, +/- strand) on sequences (Lawrence et al. 2013) as well as capabilities with working with Fastq files directly (Morgan et al. 2016).
Package | Use cases |
---|---|
Rsamtools | low level access to FASTQ, VCF, SAM, BAM, BCF formats |
GenomicRanges | Container and methods for handling genomic reagions |
GenomicFeatures | Work with transcript databases, gff, gtf and BED formats |
GenomicAlignments | Reader for BAM format |
rtracklayer | import and export multiple UCSC file formats including BigWig and Bed |
As noted in the previous section, the output of an ATAC-Seq experiment is a BAM file. As paired-end sequencing is a commonly-applied approach for ATAC-Seq, the readGAlignmentPairs
function is the appropriate method to use.
23 Data import and quality control
Reading a paired-end BAM file looks a bit complicated, but the following code will:
- Read the included BAM file.
- Include read pairs only (
isPaired = TRUE
) - Include properly paired reads (
isProperPair = TRUE
) - Include reads with mapping quality >= 1
- Add a couple of additional fields,
mapq
(mapping quality) andisize
(insert size) to the default fields.
greenleaf <- readGAlignmentPairs(
"https://github.com/seandavi/RBiocBook/raw/main/atac-seq/extdata/Sorted_ATAC_21_22.bam",
param = ScanBamParam(
mapqFilter = 1,
flag = scanBamFlag(
isPaired = TRUE,
isProperPair = TRUE
),
what = c("mapq", "isize")
)
)
Exercise: What is the class of greenleaf
? Exercise: Use the GenomicAlignments::first()
accessor to get the first read of the pair as a GAlignments
object. Save the result as a variable called gl_first_read
. Use the mcols
accessor to find the “metadata columns” of gl_first_read
. Exercise: How many read pairs map to each chromosome?
We can make plot of the number of reads mapping to each chromosome.
To keep things small, the example BAM file includes only chromosomes 21 and 22.
ggplot(chromCounts, aes(x = chromosome, y = count)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Normalizing by the chromosome length can yield the reads per megabase which should crudely be similar across all chromosomes.
chromCounts <- chromCounts %>%
dplyr::mutate(readsPerMb = (count / (seqlengths(greenleaf) / 1e6)))
And show a plot. For two chromosomes, this is a little underwhelming.
ggplot(chromCounts, aes(x = chromosome, y = readsPerMb)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_bw()
23.1 Coverage
The coverage
method for genomic ranges calculates, for each base, the number of overlapping features. In the case of a BAM file from ATAC-Seq converted to a GAlignmentPairs object, the coverage gives us an idea of the extent to which reads pile up to form peaks.
The coverage is returned as a SimpleRleList
object. Using names
can get us the names of the elements of the list.
names(cvg)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
There is a name for each chromosome. Looking at the chr21
entry:
cvg$chr21
integer-Rle of length 48129895 with 397462 runs
Lengths: 9411376 50 11 50 ... 36 14 28 10806
Values : 0 2 0 2 ... 1 2 1 0
we see that each chromosome is represented as an Rle
, short for run-length-encoding. Simply put, since along the chromosome there are many repeated values, we can recode the long vector as a set of (length: value) pairs. For example, if the first 9,410,000 base pairs have 0
coverage, we encode that as (9,410,000: 0). Doing that across the chromosome can very significantly reduce the memory use for genomic coverage.
The following little function, plotCvgHistByChrom
can plot a histogram of the coverage for a chromosome.
plotCvgHistByChrom <- function(cvg, chromosome) {
library(ggplot2)
cvgcounts <- as.data.frame(table(cvg[[chromosome]]))
cvgcounts[, 1] <- as.numeric(as.character(cvgcounts[, 1]))
colnames(cvgcounts) <- c("Coverage", "Count")
ggplot(cvgcounts, aes(x = Coverage, y = Count)) +
ggtitle(paste("Chromosome", chromosome)) +
geom_point(alpha = 0.5) +
geom_smooth(span = 0.2) +
scale_y_log10() +
theme_bw()
}
for (i in c("chr21", "chr22")) {
print(plotCvgHistByChrom(cvg, i))
}
23.2 Fragment Lengths
The first ATAC-Seq manuscript (Buenrostro et al. 2013) highlighted the relationship between fragment length and nucleosomes (see Figure @ref{fig:flgreenleaf}).
knitr::include_graphics("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3959825/bin/nihms554473f2.jpg")
Remember that we loaded the example BAM file with insert sizes (isize
). We can use that “column” to examine the fragment lengths (another name for insert size). Also, note that the insert size for the first
read and the second
are the same (absolute value). Here, we will use first
.
We can plot the fragment length density (histogram) using the density
function.
Exercise: Adjust the xlim
, bw
, and try log="y"
in the plot to highlight features present in figure \(\ref{fig:flgreenleaf}\).
And for fun, the ggplot2 version:
library(dplyr)
library(ggplot2)
fragLenPlot <- table(fraglengths) %>%
data.frame() %>%
rename(
InsertSize = fraglengths,
Count = Freq
) %>%
mutate(
InsertSize = as.numeric(as.vector(InsertSize)),
Count = as.numeric(as.vector(Count))
) %>%
ggplot(aes(x = InsertSize, y = Count)) +
geom_line()
print(fragLenPlot + theme_bw() + lims(x = c(-1, 250)))
Knowing that the nucleosome-free regions will have insert sizes shorter than one nucleosome, we can isolate the read pairs that have that characteristic.
And the mononucleosome reads will be between 187 and 250 base pairs for insert size/fragment length.
Finally, we expect nucleosome-free reads to be enriched near the TSS while mononucleosome reads should not be. We will use the heatmaps package to take a look at these two sets of reads with respect to the tss of the human genome.
Take a look at the heatmaps package vignette to learn more about the heatmaps package capabilities.
library(heatmaps)
gl_nf_hm <- CoverageHeatmap(proms, coverage(gl_nf), coords = c(-250, 250))
label(gl_nf_hm) <- "NucFree"
scale(gl_nf_hm) <- c(0, 10)
plotHeatmapMeta(gl_nf_hm)
gl_mn_hm <- CoverageHeatmap(proms, coverage(gl_mn), coords = c(-250, 250))
label(gl_mn_hm) <- "MonoNuc"
scale(gl_mn_hm) <- c(0, 10)
plotHeatmapMeta(gl_mn_hm)
plotHeatmapList(list(gl_mn_hm, gl_nf_hm))
24 Viewing data in IGV
Install IGV from here.
We export the greenleaf data as a BigWig file.
Exercise: In IGV, choose hg19
. Then, load the greenleaf.bw
file and explore chromosomes 21 and 22. Exercise: Export the nucleosome-free portion of the data as a BigWig file and examine that in IGV. Where do you expect to see the strongest signals?
25 Additional work
For those working extensively on ATAC-Seq, there is a great workflow/tutorial available from Thomas Carrol:
https://rockefelleruniversity.github.io/RU_ATAC_Workshop.html
Feel free to work through it. In addition to the work above, there is also the ATACseqQC package vignette that offers more than just QC. At least a couple more packages are available in Bioconductor.
Appendix
Session info
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] rtracklayer_1.64.0
[2] heatmaps_1.28.0
[3] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[4] GenomicFeatures_1.56.0
[5] AnnotationDbi_1.66.0
[6] dplyr_1.1.4
[7] ggplot2_3.5.1
[8] GenomicAlignments_1.40.0
[9] Rsamtools_2.20.0
[10] Biostrings_2.72.1
[11] XVector_0.44.0
[12] SummarizedExperiment_1.34.0
[13] Biobase_2.64.0
[14] MatrixGenerics_1.16.0
[15] matrixStats_1.3.0
[16] GenomicRanges_1.56.0
[17] GenomeInfoDb_1.40.1
[18] IRanges_2.38.0
[19] S4Vectors_0.42.0
[20] BiocGenerics_0.50.0
[21] BiocStyle_2.32.0
[22] knitr_1.47
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 EBImage_4.46.0 blob_1.2.4
[4] bitops_1.0-7 fastmap_1.2.0 RCurl_1.98-1.14
[7] XML_3.99-0.16.1 digest_0.6.35 lifecycle_1.0.4
[10] KEGGREST_1.44.0 RSQLite_2.3.7 magrittr_2.0.3
[13] compiler_4.4.0 rlang_1.1.4 tools_4.4.0
[16] plotrix_3.8-4 utf8_1.2.4 yaml_2.3.8
[19] S4Arrays_1.4.1 htmlwidgets_1.6.4 bit_4.0.5
[22] curl_5.2.1 DelayedArray_0.30.1 RColorBrewer_1.1-3
[25] KernSmooth_2.23-24 abind_1.4-5 BiocParallel_1.38.0
[28] withr_3.0.0 grid_4.4.0 fansi_1.0.6
[31] colorspace_2.1-0 scales_1.3.0 cli_3.6.2
[34] rmarkdown_2.27 crayon_1.5.2 generics_0.1.3
[37] httr_1.4.7 rjson_0.2.21 DBI_1.2.3
[40] cachem_1.1.0 zlibbioc_1.50.0 parallel_4.4.0
[43] tiff_0.1-12 BiocManager_1.30.23 restfulr_0.0.15
[46] vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.8
[49] fftwtools_0.9-11 bit64_4.0.5 jpeg_0.1-10
[52] locfit_1.5-9.9 glue_1.7.0 codetools_0.2-20
[55] gtable_0.3.5 BiocIO_1.14.0 UCSC.utils_1.0.0
[58] munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
[61] htmltools_0.5.8.1 GenomeInfoDbData_1.2.12 R6_2.5.1
[64] evaluate_0.23 lattice_0.22-6 png_0.1-8
[67] memoise_2.0.1 SparseArray_1.4.8 xfun_0.44
[70] pkgconfig_2.0.3
MACS2
The MACS2 package is a commonly-used package for calling peaks. Installation and other details are available1.
1 https://github.com/taoliu/MACS
pip install macs2