29  Genomic ranges and features

Published

June 1, 2024

Modified

June 2, 2026

Almost everything you study in genomics lives at a place on the genome. A gene sits between two coordinates on a chromosome. An exon is a stretch within that gene. A ChIP-seq peak marks where a protein bound. A SNP is a single position. As soon as you have two such sets of places, the most interesting biological questions become questions about location: Which peaks fall inside a promoter? What is the nearest gene to this variant? Where do these two experiments agree?

In this chapter you’ll learn to represent those places as genomic ranges and to ask how they relate to one another. The workhorse is Bioconductor’s GenomicRanges package (Lawrence et al. 2013), which gives you a compact object to hold ranges plus any data attached to them, and a vocabulary of fast operations — overlap, nearest-neighbor, flank, reduce — that scale to millions of regions. We’ll build small example objects by hand so you can see exactly what each operation does, then finish with real gene models pulled from a genome annotation.

NoteA note on coordinate systems

Different tools count positions differently. The UCSC Genome Browser uses 0-based, half-open coordinates internally, while Ensembl and Bioconductor use 1-based, inclusive coordinates (position 1 is the first base, and both endpoints are included). GRanges objects are always 1-based and inclusive, which is one fewer thing to track once you stay inside Bioconductor.

29.1 What you’ll learn

29.2 Bioconductor and GenomicRanges

The Bioconductor GenomicRanges package is a comprehensive toolkit designed to handle and manipulate genomic intervals and variables systematized on these intervals (Lawrence et al. 2013). Developed by Bioconductor, this package simplifies the complexity of managing genomic data, facilitating the efficient exploration, manipulation, and visualization of such data. GenomicRanges aids in dealing with the challenges of genomic data, including its massive size, intricate relationships, and high dimensionality.

The GenomicRanges package in Bioconductor covers a wide range of use cases related to the management and analysis of genomic data. Here are some key examples:

  • Genomic Feature Manipulation
    The GenomicRanges and GRanges classes can be used to represent and manipulate various genomic features such as genes, transcripts, exons, or single-nucleotide polymorphisms (SNPs). Users can query, subset, resize, shift, or sort these features based on their genomic coordinates.
  • Genomic Interval Operations
    The GenomicRanges package provides functions for performing operations on genomic intervals, such as finding overlaps, nearest neighbors, or disjoint intervals. These operations are fundamental to many types of genomic data analyses, such as identifying genes that overlap with ChIP-seq peaks, or finding variants that are in close proximity to each other.
  • Alignments and Coverage
    The GAlignments and GAlignmentPairs classes can be used to represent alignments of sequencing reads to a reference genome, such as those produced by a read aligner. Users can then compute coverage of these alignments over genomic ranges of interest, which is a common task in RNA-seq or ChIP-seq analysis.
  • Annotation and Metadata Handling
    The metadata column of a GRanges object can be used to store various types of annotation data associated with genomic ranges, such as gene names, gene biotypes, or experimental scores. This makes it easy to perform analyses that integrate genomic coordinates with other types of biological information.
  • Genome Arithmetic
    The GenomicRanges package supports a version of “genome arithmetic”, which includes set operations (union, intersection, set difference) as well as other operations (like coverage, complement, or reduction) that are adapted to the specificities of genomic data.
  • Efficient Data Handling
    The CompressedGRangesList class provides a space-efficient way to represent a large list of GRanges objects, which is particularly useful when working with large genomic datasets, such as whole-genome sequencing data.

The GenomicRanges package in Bioconductor uses the S4 class system (see Table 29.1), which is a part of the methods package in R. The S4 system is a more rigorous and formal approach to object-oriented programming in R, providing enhanced capabilities for object design and function dispatch.

Class Name Description Potential Use
GRanges Represents a collection of genomic ranges and associated variables. ChipSeq peaks, CpG islands, etc.
GRangesList Represents a list of GenomicRanges objects. transcript models (exons, introns)
RangesList Represents a list of Ranges objects.
IRanges Represents a collection of integer ranges. used mainly to build GRanges, etc.
GPos Represents genomic positions. SNPs or other single-nucleotide locations
GAlignments Represents alignments against a reference genome. Sequence read locations from a BAM file
GAlignmentPairs Represents pairs of alignments, typically representing a single fragment of DNA. Paired-end sequence alignments
Table 29.1: Classes within the GenomicRanges package. Each class has a slightly different use case.

In the context of the GenomicRanges package, the S4 class system allows for the creation of complex, structured data objects that can effectively encapsulate genomic intervals and associated data. This system enables the package to handle the complexity and intricacy of genomic data.

For example, the GenomicRanges class in the package is an S4 class that combines several basic data types into a composite object. It includes slots for sequence names (seqnames), ranges (start and end positions), strand information, and metadata. Each slot in the S4 class corresponds to a specific component of the genomic data, and methods (see Table 29.2 and Table 29.3) can be defined to interact with these slots in a structured and predictable way.

Method Description
length Returns the number of ranges in the GRanges object.
seqnames Retrieves the sequence names of the ranges.
ranges Retrieves the start and end positions of the ranges.
strand Retrieves the strand information of the ranges.
elementMetadata Retrieves the metadata associated with the ranges.
seqlevels Returns the levels of the factor that the seqnames slot is derived from.
seqinfo Retrieves the Seqinfo (sequence information) object associated with the GRanges object.
start, end, width Retrieve or set the start or end positions, or the width of the ranges.
resize Resizes the ranges.
subset, [, [[, $ Subset or extract elements from the GRanges object.
sort Sorts the GRanges object.
shift Shifts the ranges by a certain number of base pairs.
Table 29.2: Methods for accessing, manipulating single objects

The S4 class system also supports inheritance, which allows for the creation of specialized subclasses that share certain characteristics with a parent class but have additional features or behaviors.

The S4 system’s formalism and rigor make it well-suited to the complexities of bioinformatics and genomic data analysis. It allows for the creation of robust, reliable code that can handle complex data structures and operations, making it a key part of the GenomicRanges package and other Bioconductor packages.

Method Description
findOverlaps Finds overlaps between different sets of ranges.
countOverlaps Counts overlaps between different sets of ranges.
subsetByOverlaps Subsets the ranges based on overlaps.
distanceToNearest Computes the distance to the nearest range in another set of ranges.
Table 29.3: Methods for comparing and combining multiple GenomicRanges-class objects

To get going, we can construct a GRanges object by hand as an example.

The GRanges class represents a collection of genomic ranges that each have a single start and end location on the genome. It can be used to store the location of genomic features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function. The following code just creates a GRanges object from scratch.

gr <- GRanges(
    seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
    ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
    strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
    score = 1:10,
    GC = seq(1, 0, length=10))
gr
GRanges object with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-111      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778
  d     chr2   104-114      * |         4  0.666667
  e     chr1   105-115      * |         5  0.555556
  f     chr1   106-116      + |         6  0.444444
  g     chr3   107-117      + |         7  0.333333
  h     chr3   108-118      + |         8  0.222222
  i     chr3   109-119      - |         9  0.111111
  j     chr3   110-120      - |        10  0.000000
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

This creates a GRanges object with 10 genomic ranges. The output of the GRanges show() method separates the information into a left and right hand region that are separated by | symbols (see Figure 29.1). The genomic coordinates (seqnames, ranges, and strand) are located on the left-hand side and the metadata columns are located on the right. For this example, the metadata is comprised of score and GC information, but almost anything can be stored in the metadata portion of a GRanges object.

NoteReading a GRanges: the | is the dividing line

Think of a GRanges as a spreadsheet with a heavy vertical rule drawn down the middle. Everything to the left of the | answers “where?” — the chromosome (seqnames), the start and end (ranges), and the strand. These columns are mandatory and have their own special accessors. Everything to the right of the | answers “what do we know about it?” — the metadata: scores, GC content, gene names, p-values, whatever you attach. These are optional and you reach them all at once with mcols(). Keeping that line in mind makes every later operation easier: range operations rearrange the left side; your annotations ride along on the right.

Figure 29.1: The structure of a GRanges object, which behaves a bit like a vector of ranges, although the analogy is not perfect. A GRanges object is composed of the “Ranges” part the lefthand box, the “metadata” columns (the righthand box), and a “seqinfo” part that describes the names and lengths of associated sequences. Only the “Ranges” part is required. The figure also shows a few of the “accessors” and approaches to subsetting a GRanges object.

The components of the genomic coordinates within a GRanges object can be extracted using the seqnames, ranges, and strand accessor functions.

factor-Rle of length 10 with 4 runs
  Lengths:    1    3    2    4
  Values : chr1 chr2 chr1 chr3
Levels(3): chr1 chr2 chr3
ranges(gr)
IRanges object with 10 ranges and 0 metadata columns:
        start       end     width
    <integer> <integer> <integer>
  a       101       111        11
  b       102       112        11
  c       103       113        11
  d       104       114        11
  e       105       115        11
  f       106       116        11
  g       107       117        11
  h       108       118        11
  i       109       119        11
  j       110       120        11
strand(gr)
factor-Rle of length 10 with 5 runs
  Lengths: 1 2 2 3 2
  Values : - + * + -
Levels(3): + - *

Note that the GRanges object has information to the “left” side of the | that has special accessors. The information to the right side of the |, when it is present, is the metadata and is accessed using mcols(), for “metadata columns”.

[1] "DFrame"
attr(,"package")
[1] "S4Vectors"
mcols(gr)
DataFrame with 10 rows and 2 columns
      score        GC
  <integer> <numeric>
a         1  1.000000
b         2  0.888889
c         3  0.777778
d         4  0.666667
e         5  0.555556
f         6  0.444444
g         7  0.333333
h         8  0.222222
i         9  0.111111
j        10  0.000000

Since the class of mcols(gr) is DFrame, we can use our DataFrame approaches to work with the data.

mcols(gr)$score
 [1]  1  2  3  4  5  6  7  8  9 10

We can even assign a new column. Here we add an AT column derived from GC (since the fractions sum to 1). Notice that the new column appears on the right of the |, alongside score and GC.

mcols(gr)$AT <- 1 - mcols(gr)$GC
gr
GRanges object with 10 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  e     chr1   105-115      * |         5  0.555556  0.444444
  f     chr1   106-116      + |         6  0.444444  0.555556
  g     chr3   107-117      + |         7  0.333333  0.666667
  h     chr3   108-118      + |         8  0.222222  0.777778
  i     chr3   109-119      - |         9  0.111111  0.888889
  j     chr3   110-120      - |        10  0.000000  1.000000
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

Another common way to create a GRanges object is to start with a data.frame, perhaps created by hand like below or read in using read.csv or read.table. We can convert from a data.frame, when columns are named appropriately, to a GRanges object.

df_regions <- data.frame(chromosome = rep("chr1", 10),
                         start = seq(1000, 10000, 1000),
                         end = seq(1100, 10100, 1000))
as(df_regions, 'GRanges') # note that names have to match with GRanges slots
GRanges object with 10 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
   [1]     chr1   1000-1100      *
   [2]     chr1   2000-2100      *
   [3]     chr1   3000-3100      *
   [4]     chr1   4000-4100      *
   [5]     chr1   5000-5100      *
   [6]     chr1   6000-6100      *
   [7]     chr1   7000-7100      *
   [8]     chr1   8000-8100      *
   [9]     chr1   9000-9100      *
  [10]     chr1 10000-10100      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
## fix column name
colnames(df_regions)[1] <- 'seqnames'
gr2 <- as(df_regions, 'GRanges')
gr2
GRanges object with 10 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
   [1]     chr1   1000-1100      *
   [2]     chr1   2000-2100      *
   [3]     chr1   3000-3100      *
   [4]     chr1   4000-4100      *
   [5]     chr1   5000-5100      *
   [6]     chr1   6000-6100      *
   [7]     chr1   7000-7100      *
   [8]     chr1   8000-8100      *
   [9]     chr1   9000-9100      *
  [10]     chr1 10000-10100      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The first as() call worked even though the chromosome column was named chromosomeas() recognized start and end and treated the rest as metadata. After we rename that column to seqnames, the conversion places it correctly on the left of the | as the sequence name.

GRanges have one-dimensional-like behavior. For instance, we can check the length and even give GRanges names.

names(gr)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
length(gr)
[1] 10

29.2.1 Subsetting GRanges objects

While GRanges objects look a bit like a data.frame, they can be thought of as vectors with associated ranges. Subsetting, then, works very similarly to vectors. To subset a GRanges object to include only second and third regions:

gr[2:3]
GRanges object with 2 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

That said, if the GRanges object has metadata columns, we can also treat it like a two-dimensional object kind of like a data.frame. Note that the information to the left of the | is not like a data.frame, so we cannot do something like gr$seqnames. Here is an example of subsetting with the subset of one metadata column.

gr[2:3, "GC"]
GRanges object with 2 ranges and 1 metadata column:
    seqnames    ranges strand |        GC
       <Rle> <IRanges>  <Rle> | <numeric>
  b     chr2   102-112      + |  0.888889
  c     chr2   103-113      + |  0.777778
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The usual head() and tail() also work just fine.

head(gr,n=2)
GRanges object with 2 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
tail(gr,n=2)
GRanges object with 2 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  i     chr3   109-119      - |         9  0.111111  0.888889
  j     chr3   110-120      - |        10  0.000000  1.000000
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

29.2.2 Interval operations on one GRanges object

29.2.2.1 Intra-range methods

The methods described in this section work one-region-at-a-time and are, therefore, called “intra-region” methods. Methods that work across all regions are described below in the Inter-range methods section.

The GRanges class has accessors for the “ranges” part of the data. For example:

## Make a smaller GRanges subset
g <- gr[1:3]
start(g) # to get start locations
[1] 101 102 103
end(g)   # to get end locations
[1] 111 112 113
width(g) # to get the "widths" of each range
[1] 11 11 11
range(g) # to get the "range" for each sequence (min(start) through max(end))
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   101-111      -
  [2]     chr2   102-113      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The GRanges class also has many methods for manipulating the ranges. The methods can be classified as intra-range methods, inter-range methods, and between-range methods. Intra-range methods operate on each element of a GRanges object independent of the other ranges in the object. For example, the flank method can be used to recover regions flanking the set of ranges represented by the GRanges object. So to get a GRanges object containing the ranges that include the 10 bases upstream of the ranges:

flank(g, 10)
GRanges object with 3 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   112-121      - |         1  1.000000  0.000000
  b     chr2    92-101      + |         2  0.888889  0.111111
  c     chr2    93-102      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

Note how flank pays attention to “strand”. To get the flanking regions downstream of the ranges, we can do:

flank(g, 10, start=FALSE)
GRanges object with 3 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1    91-100      - |         1  1.000000  0.000000
  b     chr2   113-122      + |         2  0.888889  0.111111
  c     chr2   114-123      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
WarningStrand changes what “upstream” means

For flank(), resize(), precede(), and follow(), “upstream” and “downstream” are interpreted relative to the strand. On the + strand, upstream is toward lower coordinates; on the - strand it is toward higher coordinates. A * (unstranded) range is treated as +. If your ranges have the wrong strand — or you forgot to set it — these operations will silently flank the wrong side. When a result looks backwards, check strand() first.

Other examples of intra-range methods include resize and shift. The shift method will move the ranges by a specific number of base pairs, and the resize method will extend the ranges by a specified width.

shift(g, 5)
GRanges object with 3 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   106-116      - |         1  1.000000  0.000000
  b     chr2   107-117      + |         2  0.888889  0.111111
  c     chr2   108-118      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
resize(g, 30)
GRanges object with 3 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1    82-111      - |         1  1.000000  0.000000
  b     chr2   102-131      + |         2  0.888889  0.111111
  c     chr2   103-132      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The GenomicRanges help page ?"intra-range-methods" summarizes these methods.

29.2.2.2 Inter-range methods

Inter-range methods involve comparisons between ranges in a single GRanges object. For instance, the reduce method will align the ranges and merge overlapping ranges to produce a simplified set.

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   101-111      -
  [2]     chr2   102-113      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The reduce method could, for example, be used to collapse individual overlapping coding exons into a single set of coding regions.

Sometimes one is interested in the gaps or the qualities of the gaps between the ranges represented by your GRanges object. The gaps method provides this information:

gaps(g)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     1-100      -
  [2]     chr2     1-101      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
Warninggaps() needs to know how long each chromosome is

We never told this object how long the chromosomes are, so Bioconductor assumes each chromosome ends at the largest coordinate it has seen. That means the final gap (from the last range to the true end of the chromosome) is missing. In real work you set the chromosome lengths with seqlengths() so that gaps(), coverage(), and friends know the full extent of each sequence. We can ignore that detail for these toy examples.

The disjoin method represents a GRanges object as a collection of non-overlapping ranges:

GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   101-111      -
  [2]     chr2       102      +
  [3]     chr2   103-112      +
  [4]     chr2       113      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The coverage method quantifies the degree of overlap for all the ranges in a GRanges object.

RleList of length 3
$chr1
integer-Rle of length 111 with 2 runs
  Lengths: 100  11
  Values :   0   1

$chr2
integer-Rle of length 113 with 4 runs
  Lengths: 101   1  10   1
  Values :   0   1   2   1

$chr3
integer-Rle of length 0 with 0 runs
  Lengths: 
  Values : 

The coverage is summarized as a list of coverages, one for each chromosome. The Rle class is used to store the values. Sometimes, one must convert these values to numeric using as.numeric. In many cases, this will happen automatically, though.

covg <- coverage(g)
covg_chr2 <- covg[['chr2']]
plot(covg_chr2, type='l')

The plot shows how many ranges cover each base along chr2: the line steps up to 2 where two ranges overlap and drops back to 0 in the gaps. This is exactly the kind of per-base signal you compute from aligned sequencing reads in RNA-seq or ChIP-seq.

See the GenomicRanges help page ?"intra-range-methods" for more details.

29.2.3 Set operations for GRanges objects

Between-range methods calculate relationships between different GRanges objects. Of central importance are findOverlaps and related operations; these are discussed below. Additional operations treat GRanges as mathematical sets of coordinates; union(g, g2) is the union of the coordinates in g and g2. Here are examples for calculating the union, the intersect and the asymmetric difference (using setdiff).

g2 <- head(gr, n=2)
GenomicRanges::union(g, g2)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   101-111      -
  [2]     chr2   102-113      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges::intersect(g, g2)
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   101-111      -
  [2]     chr2   102-112      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
GenomicRanges::setdiff(g, g2)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2       113      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

There is extensive additional help available or by looking at the vignettes in at the GenomicRanges pages.

?GRanges

There are also many possible methods that work with GRanges objects. To see a complete list (long), try:

methods(class="GRanges")

29.3 GRangesList

Some important genomic features, such as spliced transcripts that are are comprised of exons, are inherently compound structures. Such a feature makes much more sense when expressed as a compound object such as a GRangesList. If we think of each transcript as a set of exons, each transcript would be summarized as a GRanges object. However, if we have multiple transcripts, we want to keep them separate, with each transcript holding its own exons. A GRangesList is a list of GRanges objects. Continuing with the transcripts idea, a GRangesList can contain all the transcripts and their exons; each transcript is one element in the list.

Figure 29.2: The structure of a GRangesList, which is a list of GRanges objects. While the analogy is not perfect, a GRangesList behaves a bit like a list. Each element in the GRangesList is a Granges object. A common use case for a GRangesList is to store a list of transcripts, each of which have exons as the regions in the GRanges.

Whenever genomic features consist of multiple ranges that are grouped by a parent feature, they can be represented as a GRangesList object. Consider the simple example of the two transcript GRangesList below created using the GRangesList constructor.

gr1 <- GRanges(
    seqnames = "chr1", 
    ranges = IRanges(103, 106),
    strand = "+", 
    score = 5L, GC = 0.45)
gr2 <- GRanges(
    seqnames = c("chr1", "chr1"),
    ranges = IRanges(c(107, 113), width = 3),
    strand = c("+", "-"), 
    score = 3:4, GC = c(0.3, 0.5))

The gr1 and gr2 are each GRanges objects. We can combine them into a “named” GRangesList like so:

grl <- GRangesList("txA" = gr1, "txB" = gr2)
grl
GRangesList object of length 2:
$txA
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr1   103-106      + |         5      0.45
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

$txB
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr1   107-109      + |         3       0.3
  [2]     chr1   113-115      - |         4       0.5
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The show method for a GRangesList object displays it as a named list of GRanges objects, where the names of this list are considered to be the names of the grouping feature. In the example above, the groups of individual exon ranges are represented as separate GRanges objects which are further organized into a list structure where each element name is a transcript name. Many other combinations of grouped and labeled GRanges objects are possible of course, but this example is a common arrangement.

In some cases, GRangesLists behave quite similarly to GRanges objects.

29.3.1 Basic GRangesList accessors

Just as with GRanges object, the components of the genomic coordinates within a GRangesList object can be extracted using simple accessor methods. Not surprisingly, the GRangesList objects have many of the same accessors as GRanges objects. The difference is that many of these methods return a list since the input is now essentially a list of GRanges objects. Here are a few examples:

RleList of length 2
$txA
factor-Rle of length 1 with 1 run
  Lengths:    1
  Values : chr1
Levels(1): chr1

$txB
factor-Rle of length 2 with 1 run
  Lengths:    2
  Values : chr1
Levels(1): chr1
ranges(grl)
IRangesList object of length 2:
$txA
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]       103       106         4

$txB
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]       107       109         3
  [2]       113       115         3
strand(grl)
RleList of length 2
$txA
factor-Rle of length 1 with 1 run
  Lengths: 1
  Values : +
Levels(3): + - *

$txB
factor-Rle of length 2 with 2 runs
  Lengths: 1 1
  Values : + -
Levels(3): + - *

The length and names methods will return the length or names of the list and the seqlengths method will return the set of sequence lengths.

length(grl)
[1] 2
names(grl)
[1] "txA" "txB"
chr1 
  NA 

29.4 Relationships between region sets

One of the more powerful approaches to genomic data integration is to ask about the relationship between sets of genomic ranges. The key features of this process are to look at overlaps and distances to the nearest feature. These functionalities, combined with the operations like flank and resize, for instance, allow pretty useful analyses with relatively little code. In general, these operations are very fast, even on thousands to millions of regions.

29.4.1 Overlaps

The findOverlaps method in the GenomicRanges package is a very useful function that allows users to identify overlaps between two sets of genomic ranges.

Here’s how it works:

  • Inputs
    The function requires two GRanges objects, referred to as query and subject.
  • Processing
    The function then compares every range in the query object with every range in the subject object, looking for overlaps. An overlap is defined as any instance where the range in the query object intersects with a range in the subject object.
  • Output
    The function returns a Hits (see ?Hits) object, which is a compact representation of the matrix of overlaps. Each entry in the Hits object corresponds to a pair of overlapping ranges, with the query index and the subject index.

Here is an example of how you might use the findOverlaps function:

# Create two GRanges objects
gr1 <- gr[1:4]
gr2 <- gr[3:8]
gr1
GRanges object with 4 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
gr2
GRanges object with 6 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  e     chr1   105-115      * |         5  0.555556  0.444444
  f     chr1   106-116      + |         6  0.444444  0.555556
  g     chr3   107-117      + |         7  0.333333  0.666667
  h     chr3   108-118      + |         8  0.222222  0.777778
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
# Find overlaps  
overlaps <- findOverlaps(query = gr1, subject = gr2)  
overlaps
Hits object with 7 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           3
  [2]         2           1
  [3]         2           2
  [4]         3           1
  [5]         3           2
  [6]         4           1
  [7]         4           2
  -------
  queryLength: 4 / subjectLength: 6

In the resulting overlaps object, each row corresponds to an overlapping pair of ranges, with the first column giving the index of the range in gr1 and the second column giving the index of the overlapping range in gr2.

If you are interested in only the queryHits or the subjectHits, there are accessors for those (ie., queryHits(overlaps)). To get the actual ranges that overlap, you can use the subjectHits or queryHits as an index into the original GRanges object.

Spend some time looking at these results. Note how the strand comes into play when determining overlaps.

gr1[queryHits(overlaps)]
GRanges object with 7 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  d     chr2   104-114      * |         4  0.666667  0.333333
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
gr2[subjectHits(overlaps)]
GRanges object with 7 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  e     chr1   105-115      * |         5  0.555556  0.444444
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

As you might expect, the countOverlaps method counts the regions in the second GRanges that overlap with those that overlap with each element of the first.

countOverlaps(gr1, gr2)
a b c d 
1 2 2 2 

The subsetByOverlaps method simply subsets the query GRanges object to include only those that overlap the subject.

GRanges object with 4 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

In some cases, you may be interested in only one hit per query range. The select parameter handles this: with select="first", findOverlaps() returns a plain integer vector with one entry per query range (the index of its first overlapping subject, or NA if there is none) instead of a full Hits object.

findOverlaps(gr1, gr2, select="first")
[1] 3 1 1 1

The %over% logical operator allows us to do similar things to findOverlaps and subsetByOverlaps. It returns a TRUE/FALSE for each range, which is handy for subsetting.

gr2 %over% gr1
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
gr1[gr1 %over% gr2]
GRanges object with 4 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

29.4.2 Nearest feature

There are a number of useful methods that find the nearest feature (region) in a second set for each element in the first set.

We can review our two GRanges toy objects:

g
GRanges object with 3 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
gr
GRanges object with 10 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   101-111      - |         1  1.000000  0.000000
  b     chr2   102-112      + |         2  0.888889  0.111111
  c     chr2   103-113      + |         3  0.777778  0.222222
  d     chr2   104-114      * |         4  0.666667  0.333333
  e     chr1   105-115      * |         5  0.555556  0.444444
  f     chr1   106-116      + |         6  0.444444  0.555556
  g     chr3   107-117      + |         7  0.333333  0.666667
  h     chr3   108-118      + |         8  0.222222  0.777778
  i     chr3   109-119      - |         9  0.111111  0.888889
  j     chr3   110-120      - |        10  0.000000  1.000000
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • nearest: Performs conventional nearest neighbor finding. Returns an integer vector containing the index of the nearest neighbor range in subject for each range in x. If there is no nearest neighbor NA is returned. For details of the algorithm see the man page in the IRanges package (?nearest).

  • precede: For each range in x, precede returns the index of the range in subject that is directly preceded by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.

  • follow: The opposite of precede, follow returns the index of the range in subject that is directly followed by the range in x. Overlapping ranges are excluded. NA is returned when there are no qualifying ranges in subject.

Orientation and strand for precede and follow: Orientation is 5’ to 3’, consistent with the direction of translation. Because positional numbering along a chromosome is from left to right and transcription takes place from 5’ to 3’, precede and follow can appear to have ‘opposite’ behavior on the + and - strand. Using positions 5 and 6 as an example, 5 precedes 6 on the + strand but follows 6 on the - strand.

The table below outlines the orientation when ranges on different strands are compared. In general, a feature on * is considered to belong to both strands. The single exception is when both x and subject are * in which case both are treated as +.

       x  |  subject  |  orientation 
     -----+-----------+----------------
a)     +  |  +        |  ---> 
b)     +  |  -        |  NA
c)     +  |  *        |  --->
d)     -  |  +        |  NA
e)     -  |  -        |  <---
f)     -  |  *        |  <---
g)     *  |  +        |  --->
h)     *  |  -        |  <---
i)     *  |  *        |  --->  (the only situation where * arbitrarily means +)
res <- nearest(g, gr)
res
[1] 5 4 4

Each element of res is the index into gr of the nearest range to the corresponding range in g. So gr[res] would give you the actual nearest ranges.

While nearest and friends give the index of the nearest feature, the distance to the nearest is sometimes also useful to have. The distanceToNearest method calculates the nearest feature as well as reporting the distance.

res <- distanceToNearest(g, gr)
res
Hits object with 3 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           5 |         0
  [2]         2           4 |         0
  [3]         3           4 |         0
  -------
  queryLength: 3 / subjectLength: 10

This returns a Hits object with an extra distance column. A distance of 0 means the two ranges overlap or touch; larger values are the gap in base pairs between them.

29.5 Gene models

So far every range has been one we typed by hand. In practice you usually pull ranges from an annotation. A TxDb (“transcript database”) package provides a convenient interface to gene models from a variety of sources. The TxDb.Hsapiens.UCSC.hg38.knownGene package provides access to the UCSC knownGene gene model for the hg38 build of the human genome. Everything it returns is a GRanges (or GRangesList), so all the operations from this chapter apply directly.

Figure 29.3: A graphical representation of range operations demonstrated on a gene model.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

The transcripts function returns a GRanges object with the transcripts for all genes in the database.

tx <- transcripts(txdb)
tx
GRanges object with 412044 ranges and 2 metadata columns:
                      seqnames        ranges strand |     tx_id
                         <Rle>     <IRanges>  <Rle> | <integer>
       [1]                chr1   11121-14413      + |         1
       [2]                chr1   11125-14405      + |         2
       [3]                chr1   11410-14413      + |         3
       [4]                chr1   11411-14413      + |         4
       [5]                chr1   11426-14409      + |         5
       ...                 ...           ...    ... .       ...
  [412040] chrX_MU273397v1_alt 239036-260095      - |    412040
  [412041] chrX_MU273397v1_alt 272358-282686      - |    412041
  [412042] chrX_MU273397v1_alt 314193-316302      - |    412042
  [412043] chrX_MU273397v1_alt 314813-315236      - |    412043
  [412044] chrX_MU273397v1_alt 324527-324923      - |    412044
                     tx_name
                 <character>
       [1] ENST00000832824.1
       [2] ENST00000832825.1
       [3] ENST00000832826.1
       [4] ENST00000832827.1
       [5] ENST00000832828.1
       ...               ...
  [412040] ENST00000710260.1
  [412041] ENST00000710028.1
  [412042] ENST00000710030.1
  [412043] ENST00000710216.1
  [412044] ENST00000710031.1
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

The exons function returns a GRanges object with the exons.

ex <- exons(txdb)
ex
GRanges object with 966596 ranges and 1 metadata column:
                      seqnames        ranges strand |   exon_id
                         <Rle>     <IRanges>  <Rle> | <integer>
       [1]                chr1   11121-11211      + |         1
       [2]                chr1   11125-11211      + |         2
       [3]                chr1   11410-11671      + |         3
       [4]                chr1   11411-11671      + |         4
       [5]                chr1   11426-11671      + |         5
       ...                 ...           ...    ... .       ...
  [966592] chrX_MU273397v1_alt 314193-314248      - |    966592
  [966593] chrX_MU273397v1_alt 314813-315236      - |    966593
  [966594] chrX_MU273397v1_alt 315258-315407      - |    966594
  [966595] chrX_MU273397v1_alt 316254-316302      - |    966595
  [966596] chrX_MU273397v1_alt 324527-324923      - |    966596
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

The genes function returns a GRanges object with one range per gene.

gn <- genes(txdb)
gn
GRanges object with 35356 ranges and 1 metadata column:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
          1    chr19   58345178-58362751      - |           1
         10     chr8   18386273-18401218      + |          10
        100    chr20   44584896-44652252      - |         100
       1000    chr18   27932879-28177946      - |        1000
  100008586     chrX   49551278-49568218      + |   100008586
        ...      ...                 ...    ... .         ...
       9990    chr15   34229784-34338060      - |        9990
       9991     chr9 112215071-112333653      - |        9991
       9992    chr21   34180729-34371381      + |        9992
       9993    chr22   19036282-19122454      - |        9993
       9997    chr22   50523568-50526461      - |        9997
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

Notice the scale: these are real GRanges objects with tens of thousands of ranges, yet they print and subset instantly. That is the payoff of the compact representation — the same findOverlaps(), nearest(), and subsetByOverlaps() calls you practiced on the toy gr work unchanged here. For example, subsetByOverlaps(gn, peaks) would give you the genes hit by a set of ChIP-seq peaks, and nearest(peaks, tx) would assign each peak to its closest transcript.

TipGrouping exons by transcript

exons(txdb) gives you a flat GRanges of every exon. When you instead want the exons grouped by the transcript they belong to — a GRangesList, one element per transcript — use exonsBy(txdb, by = "tx"). That is the natural object for counting reads per transcript and is exactly the GRangesList structure we built by hand earlier.

29.6 Summary

You now have a working vocabulary for genomic locations.

  • A GRanges object holds a set of ranges. Coordinates (seqnames, ranges, strand) sit to the left of the |; optional metadata reached with mcols() sits to the right. You can build one from scratch or coerce a data.frame with as(df, "GRanges").
  • Intra-range operations (flank(), shift(), resize()) reshape each range on its own and respect strand. Inter-range operations (reduce(), disjoin(), gaps(), coverage()) summarize a set of ranges together.
  • Between-set operations answer the integration questions: findOverlaps() and subsetByOverlaps() for overlap, nearest() and distanceToNearest() for proximity, plus union()/intersect()/setdiff() for set arithmetic.
  • A GRangesList groups ranges by a parent feature, such as exons within a transcript.
  • A TxDb package (here TxDb.Hsapiens.UCSC.hg38.knownGene) hands you real gene, transcript, and exon models as GRanges, so every operation above applies to genome-scale annotation with no new syntax.

29.7 Exercises

These reuse the objects built earlier in the chapter: the 10-range gr, its 3-range subset g, and the grl GRangesList. Run the chapter’s chunks first so those objects exist.

1. Read the structure. Without running any code, which of these belong to the left of the | in a GRanges (the coordinates) and which to the right (the metadata): strand, score, seqnames, GC, width?

Left of the | (coordinates): seqnames, strand, and width (part of ranges). Right of the | (metadata): score and GC. The coordinate columns have dedicated accessors; the metadata columns are reached together with mcols().

2. Promoters by hand. A common task is to grab the region just upstream of a feature’s start — a rough promoter. Use flank() to get the 50 bases upstream of each range in g, then confirm with width() that each result is 50 bases wide.

prom <- flank(g, 50)
prom
GRanges object with 3 ranges and 3 metadata columns:
    seqnames    ranges strand |     score        GC        AT
       <Rle> <IRanges>  <Rle> | <integer> <numeric> <numeric>
  a     chr1   112-161      - |         1  1.000000  0.000000
  b     chr2    52-101      + |         2  0.888889  0.111111
  c     chr2    53-102      + |         3  0.777778  0.222222
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
width(prom)
[1] 50 50 50

flank() defaults to the upstream side (start = TRUE), measured relative to strand. Every flank is exactly 50 bases wide.

3. Collapse overlaps. Use reduce() on gr to merge overlapping or adjacent ranges into a minimal set, then report how many ranges remain with length().

reduced <- reduce(gr)
reduced
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   106-116      +
  [2]     chr1   101-111      -
  [3]     chr1   105-115      *
  [4]     chr2   102-113      +
  [5]     chr2   104-114      *
  [6]     chr3   107-118      +
  [7]     chr3   109-120      -
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
length(reduced)
[1] 7

reduce() works per chromosome and per strand, merging ranges that overlap or abut into single ranges — the standard way to collapse overlapping exons into non-redundant regions.

4. Overlap counting. Build q <- gr[1:4] and s <- gr[3:8], then use countOverlaps(q, s) to find how many ranges in s each range in q overlaps. Which range in q has the most overlaps?

q <- gr[1:4]
s <- gr[3:8]
counts <- countOverlaps(q, s)
counts
a b c d 
1 2 2 2 
which.max(counts)
b 
2 

countOverlaps() returns one integer per range in the query. Remember that strand participates in the overlap test, so a + query range will not overlap a - subject range unless you set ignore.strand = TRUE.

5. Nearest with distance. Use distanceToNearest(g, gr) and pull out the distance metadata column with mcols(...)$distance. What does a distance of 0 tell you?

hits <- distanceToNearest(g, gr)
mcols(hits)$distance
[1] 0 0 0

A distance of 0 means the nearest range overlaps or directly touches the query range — here every range in g is also present in gr, so each finds itself at distance 0.