22 Genomic ranges and features
22.1 Introduction
Genomic ranges are essential components in the field of genomics, representing intervals on the genome that specify the start and end positions of DNA segments. These ranges can denote various genomic features, such as genes, exons, regulatory elements, and regions identified in genomic studies like ChIP-seq peaks. They play a pivotal role in the annotation, comparison, and interpretation of genomic features and experimental data, making them indispensable in biological data analysis.
Understanding genomic ranges begins with the concept of coordinate systems. Different databases and tools adopt different conventions for indexing genomic coordinates. For instance, the UCSC Genome Browser uses a 0-based coordinate system, while Ensembl employs a 1-based system. Moreover, genomic ranges often include strand information, indicating whether the feature is on the positive or negative DNA strand, which is crucial for correctly interpreting gene expression and other genomic functions.
Genomic ranges come in various forms, from single ranges defined by a simple start and end position (such as a single exon) to complex multi-range sets encompassing collections of ranges like all exons of a gene. Manipulating these ranges involves several fundamental operations. Intersection allows researchers to find overlapping regions between two sets of ranges, such as identifying ChIP-seq peaks that overlap with promoter regions. Union operations combine multiple ranges into a single contiguous range, while set difference identifies regions in one set that do not overlap with another set.
Several tools and libraries have been developed to facilitate the manipulation of genomic ranges. In the R programming environment, the Bioconductor project provides the GenomicRanges package, which is specifically designed for representing and manipulating genomic ranges. This package offers a variety of functions for range arithmetic and efficient overlap queries. Another useful R package is rtracklayer, which enables the import and export of genomic data in various formats, including BED and GFF files.
For those who prefer a command-line interface, BEDTools offers a suite of utilities for performing a wide range of operations on genomic intervals. This toolset is highly versatile, supporting tasks like intersecting, merging, and complementing genomic intervals. In the Python ecosystem, PyRanges provides a fast and flexible library for manipulating genomic intervals, offering similar functionality to Bioconductor’s GenomicRanges.
The applications of genomic ranges are diverse and far-reaching. In gene annotation, for instance, RNA-seq reads are mapped to known gene models to quantify gene expression levels. Variant annotation involves mapping variants identified from sequencing data to their genomic context, predicting functional consequences based on their location within genes or intergenic regions. Comparative genomics leverages genomic ranges to compare intervals between species, identifying conserved regions that might indicate essential functional elements. Epigenomic studies utilize genomic ranges to intersect DNA methylation data or histone modification peaks with genomic features, providing insights into regulatory mechanisms.
Despite their utility, working with genomic ranges presents several challenges. Converting coordinates between different reference genomes or different versions of the same genome can be complex and prone to errors. Integrating diverse types of genomic data, such as DNA sequences, epigenetic marks, and RNA-seq data, requires meticulous handling of genomic coordinates and ranges to ensure accurate analyses. Moreover, the sheer scale of genomic data necessitates optimized algorithms and data structures to handle large datasets efficiently.
Interpreting genomic ranges within their biological context is crucial for drawing meaningful conclusions. For instance, a range within a gene’s promoter region might indicate potential regulatory activity. Understanding the functional implications of genomic ranges often involves overlapping these ranges with known functional elements, such as enhancers or silencers, to infer gene regulation mechanisms and their phenotypic consequences. Tools like the UCSC Genome Browser and the Integrative Genomics Viewer (IGV) are invaluable for visualizing genomic ranges alongside other genomic annotations, aiding in the interpretation and exploration of genomic data.
22.2 Bioconductor and GenomicRanges
The Bioconductor GenomicRanges
package is a comprehensive toolkit designed to handle and manipulate genomic intervals and variables systematized on these intervals (Lawrence et al. 2013). Developed by Bioconductor, this package simplifies the complexity of managing genomic data, facilitating the efficient exploration, manipulation, and visualization of such data. GenomicRanges aids in dealing with the challenges of genomic data, including its massive size, intricate relationships, and high dimensionality.
The GenomicRanges package in Bioconductor covers a wide range of use cases related to the management and analysis of genomic data. Here are some key examples:
- Genomic Feature Manipulation
- The GenomicRanges and GRanges classes can be used to represent and manipulate various genomic features such as genes, transcripts, exons, or single-nucleotide polymorphisms (SNPs). Users can query, subset, resize, shift, or sort these features based on their genomic coordinates.
- Genomic Interval Operations
- The GenomicRanges package provides functions for performing operations on genomic intervals, such as finding overlaps, nearest neighbors, or disjoint intervals. These operations are fundamental to many types of genomic data analyses, such as identifying genes that overlap with ChIP-seq peaks, or finding variants that are in close proximity to each other.
- Alignments and Coverage
- The GAlignments and GAlignmentPairs classes can be used to represent alignments of sequencing reads to a reference genome, such as those produced by a read aligner. Users can then compute coverage of these alignments over genomic ranges of interest, which is a common task in RNA-seq or ChIP-seq analysis.
- Annotation and Metadata Handling
- The metadata column of a GRanges object can be used to store various types of annotation data associated with genomic ranges, such as gene names, gene biotypes, or experimental scores. This makes it easy to perform analyses that integrate genomic coordinates with other types of biological information.
- Genome Arithmetic
- The GenomicRanges package supports a version of “genome arithmetic”, which includes set operations (union, intersection, set difference) as well as other operations (like coverage, complement, or reduction) that are adapted to the specificities of genomic data.
- Efficient Data Handling
- The CompressedGRangesList class provides a space-efficient way to represent a large list of GRanges objects, which is particularly useful when working with large genomic datasets, such as whole-genome sequencing data.
The GenomicRanges package in Bioconductor uses the S4 class system (see Table 22.1), which is a part of the methods package in R. The S4 system is a more rigorous and formal approach to object-oriented programming in R, providing enhanced capabilities for object design and function dispatch.
Class Name | Description | Potential Use |
---|---|---|
GRanges | Represents a collection of genomic ranges and associated variables. | ChipSeq peaks, CpG islands, etc. |
GRangesList | Represents a list of GenomicRanges objects. | transcript models (exons, introns) |
RangesList | Represents a list of Ranges objects. | |
IRanges | Represents a collection of integer ranges. | used mainly to build GRanges, etc. |
GPos | Represents genomic positions. | SNPs or other single nicleotide locations |
GAlignments | Represents alignments against a reference genome. | Sequence read locations from a BAM file |
GAlignmentPairs | Represents pairs of alignments, typically representing a single fragment of DNA. | Paired-end sequence alignments |
In the context of the GenomicRanges package, the S4 class system allows for the creation of complex, structured data objects that can effectively encapsulate genomic intervals and associated data. This system enables the package to handle the complexity and intricacy of genomic data.
For example, the GenomicRanges class in the package is an S4 class that combines several basic data types into a composite object. It includes slots for sequence names (seqnames), ranges (start and end positions), strand information, and metadata. Each slot in the S4 class corresponds to a specific component of the genomic data, and methods (see Table 22.2 and Table 22.3) can be defined to interact with these slots in a structured and predictable way.
Method | Description |
---|---|
length | Returns the number of ranges in the GRanges object. |
seqnames | Retrieves the sequence names of the ranges. |
ranges | Retrieves the start and end positions of the ranges. |
strand | Retrieves the strand information of the ranges. |
elementMetadata | Retrieves the metadata associated with the ranges. |
seqlevels | Returns the levels of the factor that the seqnames slot is derived from. |
seqinfo | Retrieves the Seqinfo (sequence information) object associated with the GRanges object. |
start, end, width | Retrieve or set the start or end positions, or the width of the ranges. |
resize | Resizes the ranges. |
subset, [, [[, $ | Subset or extract elements from the GRanges object. |
sort | Sorts the GRanges object. |
shift | Shifts the ranges by a certain number of base pairs. |
The S4 class system also supports inheritance, which allows for the creation of specialized subclasses that share certain characteristics with a parent class but have additional features or behaviors.
The S4 system’s formalism and rigor make it well-suited to the complexities of bioinformatics and genomic data analysis. It allows for the creation of robust, reliable code that can handle complex data structures and operations, making it a key part of the GenomicRanges package and other Bioconductor packages.
Method | Description |
---|---|
findOverlaps | Finds overlaps between different sets of ranges. |
countOverlaps | Counts overlaps between different sets of ranges. |
subsetByOverlaps | Subsets the ranges based on overlaps. |
distanceToNearest | Computes the distance to the nearest range in another set of ranges. |
To get going, we can construct a GRanges
object by hand as an example.
The GRanges
class represents a collection of genomic ranges that each have a single start and end location on the genome. It can be used to store the location of genomic features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function. The following code just creates a GRanges
object from scratch.
gr <- GRanges(
seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
score = 1:10,
GC = seq(1, 0, length=10))
gr
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-111 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
c chr2 103-113 + | 3 0.777778
d chr2 104-114 * | 4 0.666667
e chr1 105-115 * | 5 0.555556
f chr1 106-116 + | 6 0.444444
g chr3 107-117 + | 7 0.333333
h chr3 108-118 + | 8 0.222222
i chr3 109-119 - | 9 0.111111
j chr3 110-120 - | 10 0.000000
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
This creates a GRanges
object with 10 genomic ranges. The output of the GRanges
show()
method separates the information into a left and right hand region that are separated by |
symbols (see Figure 22.1). The genomic coordinates (seqnames, ranges, and strand) are located on the left-hand side and the metadata columns are located on the right. For this example, the metadata is comprised of score and GC information, but almost anything can be stored in the metadata portion of a GRanges object.

GRanges
object, which behaves a bit like a vector of ranges, although the analogy is not perfect. A GRanges
object is composed of the “Ranges” part the lefthand box, the “metadata” columns (the righthand box), and a “seqinfo” part that describes the names and lengths of associated sequences. Only the “Ranges” part is required. The figure also shows a few of the “accessors” and approaches to subsetting a GRanges
object.
The components of the genomic coordinates within a GRanges object can be extracted using the seqnames, ranges, and strand accessor functions.
seqnames(gr)
factor-Rle of length 10 with 4 runs
Lengths: 1 3 2 4
Values : chr1 chr2 chr1 chr3
Levels(3): chr1 chr2 chr3
ranges(gr)
IRanges object with 10 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
a 101 111 11
b 102 112 11
c 103 113 11
d 104 114 11
e 105 115 11
f 106 116 11
g 107 117 11
h 108 118 11
i 109 119 11
j 110 120 11
strand(gr)
factor-Rle of length 10 with 5 runs
Lengths: 1 2 2 3 2
Values : - + * + -
Levels(3): + - *
Note that the GRanges
object has information to the “left” side of the |
that has special accessors. The information to the right side of the |
, when it is present, is the metadata and is accessed using mcols()
, for “metadata columns”.
[1] "DFrame"
attr(,"package")
[1] "S4Vectors"
mcols(gr)
DataFrame with 10 rows and 2 columns
score GC
<integer> <numeric>
a 1 1.000000
b 2 0.888889
c 3 0.777778
d 4 0.666667
e 5 0.555556
f 6 0.444444
g 7 0.333333
h 8 0.222222
i 9 0.111111
j 10 0.000000
Since the class
of mcols(gr)
is DFrame, we can use our DataFrame
approaches to work with the data.
mcols(gr)$score
[1] 1 2 3 4 5 6 7 8 9 10
We can even assign a new column.
GRanges object with 10 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
e chr1 105-115 * | 5 0.555556 0.444444
f chr1 106-116 + | 6 0.444444 0.555556
g chr3 107-117 + | 7 0.333333 0.666667
h chr3 108-118 + | 8 0.222222 0.777778
i chr3 109-119 - | 9 0.111111 0.888889
j chr3 110-120 - | 10 0.000000 1.000000
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
Another common way to create a GRanges
object is to start with a data.frame
, perhaps created by hand like below or read in using read.csv
or read.table
. We can convert from a data.frame
, when columns are named appropriately, to a GRanges
object.
df_regions = data.frame(chromosome = rep("chr1",10),
start=seq(1000,10000,1000),
end=seq(1100, 10100, 1000))
as(df_regions,'GRanges') # note that names have to match with GRanges slots
GRanges object with 10 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1000-1100 *
[2] chr1 2000-2100 *
[3] chr1 3000-3100 *
[4] chr1 4000-4100 *
[5] chr1 5000-5100 *
[6] chr1 6000-6100 *
[7] chr1 7000-7100 *
[8] chr1 8000-8100 *
[9] chr1 9000-9100 *
[10] chr1 10000-10100 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 10 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1000-1100 *
[2] chr1 2000-2100 *
[3] chr1 3000-3100 *
[4] chr1 4000-4100 *
[5] chr1 5000-5100 *
[6] chr1 6000-6100 *
[7] chr1 7000-7100 *
[8] chr1 8000-8100 *
[9] chr1 9000-9100 *
[10] chr1 10000-10100 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges
have one-dimensional-like behavior. For instance, we can check the length
and even give GRanges
names.
22.2.1 Subsetting GRanges objects
While GRanges
objects look a bit like a data.frame
, they can be thought of as vectors with associated ranges. Subsetting, then, works very similarly to vectors. To subset a GRanges
object to include only second and third regions:
gr[2:3]
GRanges object with 2 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
That said, if the GRanges
object has metadata columns, we can also treat it like a two-dimensional object kind of like a data.frame. Note that the information to the left of the |
is not like a data.frame
, so we cannot do something like gr$seqnames
. Here is an example of subsetting with the subset of one metadata column.
gr[2:3, "GC"]
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | GC
<Rle> <IRanges> <Rle> | <numeric>
b chr2 102-112 + | 0.888889
c chr2 103-113 + | 0.777778
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
The usual head()
and tail()
also work just fine.
head(gr,n=2)
GRanges object with 2 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
tail(gr,n=2)
GRanges object with 2 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
i chr3 109-119 - | 9 0.111111 0.888889
j chr3 110-120 - | 10 0.000000 1.000000
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
22.2.2 Interval operations on one GRanges object
22.2.2.1 Intra-range methods
The methods described in this section work one-region-at-a-time and are, therefore, called “intra-region” methods. Methods that work across all regions are described below in the Inter-range methods section.
The GRanges
class has accessors for the “ranges” part of the data. For example:
## Make a smaller GRanges subset
g <- gr[1:3]
start(g) # to get start locations
[1] 101 102 103
end(g) # to get end locations
[1] 111 112 113
width(g) # to get the "widths" of each range
[1] 11 11 11
range(g) # to get the "range" for each sequence (min(start) through max(end))
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101-111 -
[2] chr2 102-113 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
The GRanges class also has many methods for manipulating the ranges. The methods can be classified as intra-range methods, inter-range methods, and between-range methods. Intra-range methods operate on each element of a GRanges object independent of the other ranges in the object. For example, the flank method can be used to recover regions flanking the set of ranges represented by the GRanges object. So to get a GRanges object containing the ranges that include the 10 bases upstream of the ranges:
flank(g, 10)
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 112-121 - | 1 1.000000 0.000000
b chr2 92-101 + | 2 0.888889 0.111111
c chr2 93-102 + | 3 0.777778 0.222222
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
Note how flank
pays attention to “strand”. To get the flanking regions downstream of the ranges, we can do:
flank(g, 10, start=FALSE)
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 91-100 - | 1 1.000000 0.000000
b chr2 113-122 + | 2 0.888889 0.111111
c chr2 114-123 + | 3 0.777778 0.222222
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
Other examples of intra-range methods include resize
and shift
. The shift method will move the ranges by a specific number of base pairs, and the resize method will extend the ranges by a specified width.
shift(g, 5)
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 106-116 - | 1 1.000000 0.000000
b chr2 107-117 + | 2 0.888889 0.111111
c chr2 108-118 + | 3 0.777778 0.222222
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(g, 30)
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 82-111 - | 1 1.000000 0.000000
b chr2 102-131 + | 2 0.888889 0.111111
c chr2 103-132 + | 3 0.777778 0.222222
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
The GenomicRanges
help page ?"intra-range-methods"
summarizes these methods.
22.2.2.2 Inter-range methods
Inter-range methods involve comparisons between ranges in a single GRanges object. For instance, the reduce method will align the ranges and merge overlapping ranges to produce a simplified set.
reduce(g)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101-111 -
[2] chr2 102-113 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
The reduce method could, for example, be used to collapse individual overlapping coding exons into a single set of coding regions.
Sometimes one is interested in the gaps or the qualities of the gaps between the ranges represented by your GRanges object. The gaps method provides this information:
gaps(g)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-100 -
[2] chr2 1-101 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In this case, we have not specified the lengths of the chromosomes, so Bioconductor is making the assumption (incorrectly) that the chromosomes end at the largest location on each chromosome. We can correct this by setting the seqlengths
correctly, but we can ignore that detail for now.
The disjoin method represents a GRanges object as a collection of non-overlapping ranges:
disjoin(g)
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101-111 -
[2] chr2 102 +
[3] chr2 103-112 +
[4] chr2 113 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
The coverage
method quantifies the degree of overlap for all the ranges in a GRanges object.
coverage(g)
RleList of length 3
$chr1
integer-Rle of length 111 with 2 runs
Lengths: 100 11
Values : 0 1
$chr2
integer-Rle of length 113 with 4 runs
Lengths: 101 1 10 1
Values : 0 1 2 1
$chr3
integer-Rle of length 0 with 0 runs
Lengths:
Values :
The coverage is summarized as a list
of coverages, one for each chromosome. The Rle
class is used to store the values. Sometimes, one must convert these values to numeric
using as.numeric
. In many cases, this will happen automatically, though.
See the GenomicRanges
help page ?"intra-range-methods"
for more details.
22.2.3 Set operations for GRanges objects
Between-range methods calculate relationships between different GRanges objects. Of central importance are findOverlaps and related operations; these are discussed below. Additional operations treat GRanges as mathematical sets of coordinates; union(g, g2) is the union of the coordinates in g and g2. Here are examples for calculating the union, the intersect and the asymmetric difference (using setdiff).
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101-111 -
[2] chr2 102-113 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges::intersect(g, g2)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 101-111 -
[2] chr2 102-112 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges::setdiff(g, g2)
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 113 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
There is extensive additional help available or by looking at the vignettes in at the GenomicRanges
pages.
?GRanges
There are also many possible methods
that work with GRanges
objects. To see a complete list (long), try:
methods(class="GRanges")
22.3 GRangesList
Some important genomic features, such as spliced transcripts that are are comprised of exons, are inherently compound structures. Such a feature makes much more sense when expressed as a compound object such as a GRangesList. If we thing of each transcript as a set of exons, each transcript would be summarized as a GRanges
object. However, if we have multiple transcripts, we want to somehow keep them separate, with each transcript having its own exons. The GRangesList
is then a list of GRanges
objects that. Continuing with the transcripts thought, a GRangesList
can contain all the transcripts and their exons; each transcript is an element in the list.

list
of GRanges objects. While the analogy is not perfect, a GRangesList
behaves a bit like a list. Each element in the GRangesList
is a Granges
object. A common use case for a GRangesList
is to store a list of transcripts, each of which have exons as the regions in the GRanges
.
Whenever genomic features consist of multiple ranges that are grouped by a parent feature, they can be represented as a GRangesList object. Consider the simple example of the two transcript GRangesList below created using the GRangesList constructor.
The gr1
and gr2
are each GRanges
objects. We can combine them into a “named” GRangesList
like so:
grl <- GRangesList("txA" = gr1, "txB" = gr2)
grl
GRangesList object of length 2:
$txA
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chr1 103-106 + | 5 0.45
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
$txB
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chr1 107-109 + | 3 0.3
[2] chr1 113-115 - | 4 0.5
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The show
method for a GRangesList
object displays it as a named list of GRanges
objects, where the names of this list are considered to be the names of the grouping feature. In the example above, the groups of individual exon ranges are represented as separate GRanges
objects which are further organized into a list structure where each element name is a transcript name. Many other combinations of grouped and labeled GRanges
objects are possible of course, but this example is a common arrangement.
In some cases, GRangesList
s behave quite similarly to GRanges
objects.
22.3.1 Basic GRangesList accessors
Just as with GRanges object, the components of the genomic coordinates within a GRangesList object can be extracted using simple accessor methods. Not surprisingly, the GRangesList objects have many of the same accessors as GRanges objects. The difference is that many of these methods return a list since the input is now essentially a list of GRanges objects. Here are a few examples:
seqnames(grl)
RleList of length 2
$txA
factor-Rle of length 1 with 1 run
Lengths: 1
Values : chr1
Levels(1): chr1
$txB
factor-Rle of length 2 with 1 run
Lengths: 2
Values : chr1
Levels(1): chr1
ranges(grl)
IRangesList object of length 2:
$txA
IRanges object with 1 range and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 103 106 4
$txB
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 107 109 3
[2] 113 115 3
strand(grl)
RleList of length 2
$txA
factor-Rle of length 1 with 1 run
Lengths: 1
Values : +
Levels(3): + - *
$txB
factor-Rle of length 2 with 2 runs
Lengths: 1 1
Values : + -
Levels(3): + - *
The length and names methods will return the length or names of the list and the seqlengths method will return the set of sequence lengths.
22.4 Relationships between region sets
One of the more powerful approaches to genomic data integration is to ask about the relationship between sets of genomic ranges. The key features of this process are to look at overlaps and distances to the nearest feature. These functionalities, combined with the operations like flank
and resize
, for instance, allow pretty useful analyses with relatively little code. In general, these operations are very fast, even on thousands to millions of regions.
22.4.1 Overlaps
The findOverlaps method in the GenomicRanges package is a very useful function that allows users to identify overlaps between two sets of genomic ranges.
Here’s how it works:
- Inputs
- The function requires two GRanges objects, referred to as query and subject.
- Processing
- The function then compares every range in the query object with every range in the subject object, looking for overlaps. An overlap is defined as any instance where the range in the query object intersects with a range in the subject object.
- Output
-
The function returns a Hits (see
?Hits
) object, which is a compact representation of the matrix of overlaps. Each entry in the Hits object corresponds to a pair of overlapping ranges, with the query index and the subject index.
Here is an example of how you might use the findOverlaps function:
# Create two GRanges objects
gr1 <- gr[1:4]
gr2 <- gr[3:8]
gr1
GRanges object with 4 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
gr2
GRanges object with 6 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
e chr1 105-115 * | 5 0.555556 0.444444
f chr1 106-116 + | 6 0.444444 0.555556
g chr3 107-117 + | 7 0.333333 0.666667
h chr3 108-118 + | 8 0.222222 0.777778
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
# Find overlaps
overlaps <- findOverlaps(query = gr1, subject = gr2)
overlaps
Hits object with 7 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 3
[2] 2 1
[3] 2 2
[4] 3 1
[5] 3 2
[6] 4 1
[7] 4 2
-------
queryLength: 4 / subjectLength: 6
In the resulting overlaps object, each row corresponds to an overlapping pair of ranges, with the first column giving the index of the range in gr1 and the second column giving the index of the overlapping range in gr2.
If you are interested in only the queryHits
or the subjectHits
, there are accessors for those (ie., queryHits(overlaps)
). To get the actual ranges that overlap, you can use the subjectHits
or queryHits
as an index into the original GRanges
object.
Spend some time looking at these results. Note how the strand comes into play when determining overlaps.
gr1[queryHits(overlaps)]
GRanges object with 7 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
d chr2 104-114 * | 4 0.666667 0.333333
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
gr2[subjectHits(overlaps)]
GRanges object with 7 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
e chr1 105-115 * | 5 0.555556 0.444444
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
As you might expect, the countOverlaps
method counts the regions in the second GRanges
that overlap with those that overlap with each element of the first.
countOverlaps(gr1, gr2)
a b c d
1 2 2 2
The subsetByOverlaps
method simply subsets the query GRanges
object to include only those that overlap the subject.
subsetByOverlaps(gr1, gr2)
GRanges object with 4 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
In some cases, you may be interested in only one hit when doing overlaps. Note the select
parameter. See the help for findOverlaps
findOverlaps(gr1, gr2, select="first")
[1] 3 1 1 1
findOverlaps(gr1, gr2, select="first")
[1] 3 1 1 1
The %over%
logical operator allows us to do similar things to findOverlaps
and subsetByOverlaps
.
gr2 %over% gr1
[1] TRUE TRUE TRUE FALSE FALSE FALSE
gr1[gr1 %over% gr2]
GRanges object with 4 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
22.4.2 Nearest feature
There are a number of useful methods that find the nearest feature (region) in a second set for each element in the first set.
We can review our two GRanges
toy objects:
g
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
gr
GRanges object with 10 ranges and 3 metadata columns:
seqnames ranges strand | score GC AT
<Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
a chr1 101-111 - | 1 1.000000 0.000000
b chr2 102-112 + | 2 0.888889 0.111111
c chr2 103-113 + | 3 0.777778 0.222222
d chr2 104-114 * | 4 0.666667 0.333333
e chr1 105-115 * | 5 0.555556 0.444444
f chr1 106-116 + | 6 0.444444 0.555556
g chr3 107-117 + | 7 0.333333 0.666667
h chr3 108-118 + | 8 0.222222 0.777778
i chr3 109-119 - | 9 0.111111 0.888889
j chr3 110-120 - | 10 0.000000 1.000000
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
nearest: Performs conventional nearest neighbor finding. Returns an integer vector containing the index of the nearest neighbor range in subject for each range in x. If there is no nearest neighbor NA is returned. For details of the algorithm see the man page in the IRanges package (?nearest).
precede: For each range in x, precede returns the index of the range in subject that is directly preceded by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.
follow: The opposite of precede, follow returns the index of the range in subject that is directly followed by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.
Orientation and strand for precede and follow: Orientation is 5’ to 3’, consistent with the direction of translation. Because positional numbering along a chromosome is from left to right and transcription takes place from 5’ to 3’, precede and follow can appear to have ‘opposite’ behavior on the + and - strand. Using positions 5 and 6 as an example, 5 precedes 6 on the + strand but follows 6 on the - strand.
The table below outlines the orientation when ranges on different strands are compared. In general, a feature on * is considered to belong to both strands. The single exception is when both x and subject are * in which case both are treated as +.
x | subject | orientation
-----+-----------+----------------
a) + | + | --->
b) + | - | NA
c) + | * | --->
d) - | + | NA
e) - | - | <---
f) - | * | <---
g) * | + | --->
h) * | - | <---
i) * | * | ---> (the only situation where * arbitrarily means +)
res = nearest(g, gr)
res
[1] 5 4 4
While nearest
and friends give the index of the nearest feature, the distance to the nearest is sometimes also useful to have. The distanceToNearest
method calculates the nearest feature as well as reporting the distance.
res = distanceToNearest(g, gr)
res
Hits object with 3 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 5 | 0
[2] 2 4 | 0
[3] 3 4 | 0
-------
queryLength: 3 / subjectLength: 10
22.5 Gene models
The TxDb
package provides a convenient interface to gene models from a variety of sources. The TxDb.Hsapiens.UCSC.hg38.knownGene
package provides access to the UCSC knownGene gene model for the hg19 build of the human genome.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
The transcripts
function returns a GRanges
object with the transcripts for all genes in the database.
tx <- transcripts(txdb)
The exons
function returns a GRanges
object with the exons.
ex <- exons(txdb)
The genes
function returns a GRanges
object with the genes.
gn <- genes(txdb)