library(SingleCellMultiModal) # the curated 10x PBMC multiome dataset
library(MultiAssayExperiment) # the multi-modal container
library(SingleCellExperiment) # each modality is an SCE
library(Seurat) # WNN joint embedding lives here
library(Signac) # ATAC processing (TF-IDF / LSI) and ChromatinAssay
library(ggplot2) # ggtitle() to label the comparison panels
library(patchwork) # compose plots side by side with `+`39 Integrating single-cell ATAC and RNA
A single-cell RNA experiment tells you what each cell is doing: which genes it is actively transcribing, right now, in the moment it was captured. A single-cell ATAC experiment tells you something different and complementary — which stretches of the genome are open for business. ATAC-seq finds the regions of chromatin a cell has unpacked from its tightly wound default state: promoters it has switched on, enhancers it is using, the regulatory switches it has flipped. Expression is the output; accessibility is the wiring that produces it. Measure only one and you are reading half the circuit.
Why does the wiring matter if you can already see the output? Because a cell often opens a regulatory region before, or without, transcribing the gene it controls. A stem cell on the verge of becoming a blood cell can unlock the enhancers of blood-cell genes while those genes are still essentially silent — the chromatin is poised, the decision half-made, but the RNA hasn’t caught up yet. Accessibility therefore carries information about a cell’s trajectory and regulatory state that expression alone misses: which transcription factors are active, which enhancers drive which genes, where a cell is heading rather than only where it is.
The technology that makes this concrete is the multiome assay (here, 10x Genomics’ Single Cell Multiome ATAC + Gene Expression), which measures both RNA and open chromatin in the same nucleus. That “same nucleus” is the crucial part. With separate experiments you can only compare populations — the average RNA of one batch of cells against the average accessibility of another. With a multiome you can ask a cause-and-effect question inside a single cell: this enhancer is open, and this gene, which it controls, is expressed, in this very cell. That is what lets you move from cataloguing cell types to reconstructing the regulatory logic that defines them.
This chapter is about integration: combining the two modalities into one coherent analysis instead of two parallel ones. We will build the intuition for what integration means, why it is harder than it sounds, and then do it hands-on with a public 10x PBMC multiome dataset that ships ready-to-use as a Bioconductor MultiAssayExperiment. For the wider landscape of single-cell multi-omic technologies and the computational methods that pair with them, two reviews are good companions to this chapter (Argelaguet et al. 2021; Vandereyken et al. 2023).
39.1 What you’ll learn
- Why chromatin accessibility (ATAC) and gene expression (RNA) are complementary readouts of the same cell, and what each one uniquely tells you.
- The three shapes of single-cell integration — vertical, diagonal, and mosaic — and how knowing which one you have decides which methods apply.
- The feature-space problem: RNA is measured over genes, ATAC over genomic peaks, and why that mismatch is the central obstacle to integration.
- The bridges that connect the two spaces: gene-activity scores, peak-to-gene links, and transcription-factor motif activity.
- How to load and explore a real multiome dataset as a
MultiAssayExperimentof twoSingleCellExperiments, and process each modality on its own terms. - How to build a joint embedding that uses both modalities at once, and what it buys you over either modality alone.
This chapter builds directly on the single-cell RNA-seq chapter (the SingleCellExperiment, normalization, HVGs, PCA/UMAP, clustering) and on the bulk ATAC-seq chapter (peaks, fragments, what accessibility is). If either is hazy, skim it first — we lean on both.
39.2 Two views of the same cell
It helps to be precise about what each assay actually counts.
- scRNA-seq counts transcripts and assigns them to genes. A cell becomes a vector of ~20,000–30,000 gene counts. High counts for a gene mean that gene is being transcribed.
- scATAC-seq counts sequencing fragments that fall in accessible chromatin and assigns them to peaks — genomic intervals (typically a few hundred base pairs) that are open across many cells. A cell becomes a vector of counts over ~100,000+ peaks. A nonzero count for a peak means that region was accessible in that cell.
Two consequences follow immediately, and they drive everything else in the chapter:
- The two modalities live in different feature spaces. One is indexed by genes, the other by genomic coordinates. You cannot simply stack a gene-by-cell matrix on top of a peak-by-cell matrix and call it integrated — the rows mean different things.
- ATAC data is sparse and nearly binary. Each cell has at most two copies of any locus, so a peak is essentially “seen” or “not seen” in a given cell. The peak matrix is far sparser than the RNA matrix, which changes how we normalize and reduce it (more on this below — it is why ATAC uses TF-IDF and LSI rather than the log-normalization + PCA you used for RNA).
39.3 What “integration” means: vertical, diagonal, mosaic
“Integrate the ATAC with the RNA” can mean several genuinely different problems, and the single most useful thing you can do before reaching for a tool is to figure out which one you have. The cleanest way to organize them (after Argelaguet et al. (2021)) is by what the two datasets share:
| Shape | What’s shared | Typical source | The hard part |
|---|---|---|---|
| Vertical | the same cells, different modalities | 10x Multiome (RNA + ATAC per nucleus) | nothing to match — just combine the two views well |
| Diagonal | different cells, different modalities | a scRNA run and a separate scATAC run | you must infer which cells correspond |
| Mosaic | partial overlap | some cells multiome, some single-modality | bridge the measured and the missing |
The distinction is not academic — it changes the method entirely:
- In vertical integration the correspondence is handed to you. Cell 5’s RNA and cell 5’s ATAC are the same cell, by construction. The task is to learn a single representation that respects both views — a joint embedding. This is the easier, intuition-building case, and where we start.
- In diagonal integration there is no correspondence at all. You have RNA for one set of cells and ATAC for a different set, and nothing tells you that an RNA cell and an ATAC cell are “the same kind.” You must infer the alignment, almost always by projecting both into a shared feature space (the bridges below) and matching. This is the case that forces you to understand why integration is hard, and we return to it in the next chapter using label transfer.
Your collaborators will bring you both. Someone with a 10x Multiome run has a vertical problem. Someone who ran scRNA last year and scATAC this year — same tissue, different cells — has a diagonal one. Same words, “integrate ATAC with RNA,” completely different machinery.
39.4 The feature-space problem and the bridges
Because RNA lives in gene space and ATAC in peak space, every integration method needs a bridge that puts the two modalities into a common frame. There are three bridges worth knowing, and they are not interchangeable — each answers a different question:
- Gene-activity scores. Summarize a gene’s accessibility by summing the ATAC signal over its body and promoter (or, better, distance-weighted nearby peaks). This casts ATAC into gene space, so an ATAC cell can be compared to an RNA cell gene-for-gene. It is crude — accessibility is not expression — but it is the workhorse bridge for aligning unpaired datasets and for annotating ATAC cells from an RNA reference.
- Peak-to-gene links. Correlate a peak’s accessibility against a gene’s expression across cells. A peak whose opening tracks a gene’s expression is a candidate regulatory element for that gene. This is less a bridge for alignment than a biological result — it is how you discover enhancer→gene wiring, and it needs paired (or pseudobulked) data to compute.
-
Motif / TF activity. Collapse the tens of thousands of peaks into the handful of transcription factors whose binding motifs they contain (e.g. with
chromVAR(Schep et al. 2017)). This turns the unwieldy peak space into a small, interpretable space of TF activities — and TFs are the natural currency for connecting open chromatin back to the genes it regulates.
Keep these three straight and most of the integration literature snaps into focus: methods differ mainly in which bridge they use and whether they need the cells to be paired.
39.5 The data: 10x PBMC multiome
We’ll use a public 10x Genomics dataset of roughly 10,000 peripheral blood mononuclear cells (PBMCs) from a healthy donor, profiled with the Single Cell Multiome assay so that every nucleus has both an RNA and an ATAC measurement. It is a good teaching set: PBMCs contain well-characterized, easily recognized immune cell types (T cells, B cells, monocytes, NK cells), so we have strong priors about what the answer should look like.
Rather than download and reprocess the raw 10x files, we use the curated copy from the Bioconductor SingleCellMultiModal package (Eckenrode et al. 2023), which delivers the dataset as a MultiAssayExperiment — the natural container for “several assays measured on the same cells,” and the multi-modal generalization of the SummarizedExperiment/SingleCellExperiment classes you have already met.
A single call fetches the data (the first time, it downloads from Bioconductor’s ExperimentHub and caches locally; afterwards it loads from the cache):
mae <- scMultiome("pbmc_10x", modes = "*", dry.run = FALSE, format = "MTX")Working on: pbmc_atac_se.rds
Working on: pbmc_atac.mtx.gz
Working on: pbmc_rna_se.rds
Working on: pbmc_rna.mtx.gz
Working on: pbmc_atac,
pbmc_rna
see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
loading from cache
Loading required namespace: HDF5Array
see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
loading from cache
Working on: pbmc_atac,
pbmc_rna
see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
loading from cache
see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
loading from cache
Working on: pbmc_colData
Working on: pbmc_sampleMap
see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
loading from cache
see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
loading from cache
maeA MultiAssayExperiment object of 2 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 2:
[1] atac: SingleCellExperiment with 108344 rows and 10032 columns
[2] rna: SingleCellExperiment with 36549 rows and 10032 columns
Functionality:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample coordination DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
exportClass() - save data to flat files
The object bundles two experiments measured on the same 10,032 cells:
-
rna— aSingleCellExperimentof gene expression counts (~36,500 genes). -
atac— aSingleCellExperimentof peak accessibility counts (~108,000 peaks).
Because they share the same cells, the MultiAssayExperiment keeps their column (cell) metadata aligned for us — this is exactly the bookkeeping that makes vertical integration tractable. We pull out the two modalities to work with each on its own terms:
sce.rna <- experiments(mae)[["rna"]]
sce.atac <- experiments(mae)[["atac"]]
sce.rnaclass: SingleCellExperiment
dim: 36549 10032
metadata(0):
assays(1): counts
rownames(36549): MIR1302-2HG FAM138A ... CDY1 TTTY3
rowData names(2): id modality
colnames(10032): AAACAGCCAAGGAATC AAACAGCCAATCCCTT ... TTTGTTGGTTGGTTAG
TTTGTTGGTTTGCAGA
colData names(6): nCount_RNA nFeature_RNA ... celltype broad_celltype
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
sce.atacclass: SingleCellExperiment
dim: 108344 10032
metadata(0):
assays(1): counts
rownames(108344): chr1:10109-10357 chr1:180730-181630 ...
chrY:11332988-11334144 chrY:56836663-56837005
rowData names(2): id modality
colnames(10032): AAACAGCCAAGGAATC AAACAGCCAATCCCTT ... TTTGTTGGTTGGTTAG
TTTGTTGGTTTGCAGA
colData names(6): nCount_RNA nFeature_RNA ... celltype broad_celltype
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
39.6 From Bioconductor to Seurat: building the joint object
The joint-embedding method we’ll use — weighted nearest neighbors (WNN) — lives in the Seurat/Signac ecosystem (Stuart et al. 2021), which is the de-facto standard for paired multiome. So our first practical step is to bridge our MultiAssayExperiment into a Seurat object: the RNA counts become a standard expression assay, and the ATAC peak counts become a Signac ChromatinAssay. This Bioconductor→Seurat hand-off is worth learning in its own right — you will reach for it constantly, because single-cell data routinely arrives in one ecosystem and the method you want lives in the other.
# RNA becomes a standard expression assay.
pbmc <- CreateSeuratObject(counts = counts(sce.rna), assay = "RNA")
# ATAC becomes a ChromatinAssay, which identifies each peak by its genomic
# coordinates. Peaks can reach us two ways: as a GRanges in the SCE's rowRanges,
# or encoded in the rownames as coordinate strings like "chr1:9776-10668". We
# prefer the GRanges when present and fall back to parsing the rownames.
atac.counts <- counts(sce.atac)
peak.ranges <- tryCatch(rowRanges(sce.atac), error = function(e) NULL)
pbmc[["ATAC"]] <- if (is(peak.ranges, "GRanges") && length(peak.ranges) == nrow(atac.counts)) {
CreateChromatinAssay(counts = atac.counts, ranges = peak.ranges)
} else {
CreateChromatinAssay(counts = atac.counts, sep = c(":", "-"))
}
pbmcAn object of class Seurat
144893 features across 10032 samples within 2 assays
Active assay: RNA (36549 features, 0 variable features)
1 layer present: counts
1 other assay present: ATAC
The raw 10x multiome ships a large fragments file (every ATAC fragment, genome-wide). The curated dataset gives us the peak matrix instead, so we build the ChromatinAssay without fragments. WNN needs only the per-modality reductions, so this costs us nothing here — but it does mean genome-browser coverage tracks and fragment-based gene-activity scores aren’t available in this chapter. We don’t need them for the joint embedding.
39.6.1 RNA reduction: the familiar recipe
The expression side is exactly the workflow from the scRNA-seq chapter — normalize, pick highly variable genes, scale, and reduce with PCA:
DefaultAssay(pbmc) <- "RNA"
pbmc <- NormalizeData(pbmc)Normalizing layer: counts
pbmc <- FindVariableFeatures(pbmc, nfeatures = 3000)Finding variable features for layer counts
pbmc <- ScaleData(pbmc)Centering and scaling data matrix
pbmc <- RunPCA(pbmc, reduction.name = "pca")PC_ 1
Positive: SKAP1, BCL11B, CD247, IL32, EEF1A1, CAMK4, TC2N, IL7R, LEF1, BACH2
INPP4B, TRAC, LTB, BCL2, ITK, SYNE2, CD96, THEMIS, TRBC2, RORA
SLFN12L, TXK, PRKCQ, TNIK, ANK3, STAT4, GRAP2, CCR7, CARD11, TESPA1
Negative: PLXDC2, SLC8A1, LRMDA, FCN1, MCTP1, RBM47, TYMP, IRAK3, JAK2, DMXL2
NAMPT, TBXAS1, LYN, ZEB2, LRRK2, CYBB, TNFAIP2, SAT1, GAB2, HCK
CLEC7A, TLR2, CSF3R, LYST, VCAN, FGD4, GRK3, FAM49A, CD36, MARCH1
PC_ 2
Positive: CD247, BCL11B, IL32, DPYD, TC2N, AOAH, IL7R, TNIK, INPP4B, CAMK4
PDE3B, ITK, SLFN12L, TRAC, THEMIS, PRKCQ, RORA, SLCO3A1, LEF1, TXK
STAT4, ANXA1, NEAT1, FNDC3B, ARHGAP26, SRGN, CD7, ADGRE5, PLCB1, TRBC1
Negative: BANK1, MS4A1, PAX5, NIBAN3, FCRL1, IGHM, AFF3, EBF1, LINC00926, OSBPL10
CD79A, RALGPS2, COBLL1, CD22, BLNK, BLK, AP002075.1, IGHD, ADAM28, COL19A1
BCL11A, CD79B, PLEKHG1, GNG7, DENND5B, WDFY4, SPIB, AC120193.1, RUBCNL, TCF4
PC_ 3
Positive: MAML2, LEF1, PDE3B, BACH2, CCR7, CAMK4, IL7R, FHIT, NELL2, LTB
NR3C2, RASGRF2, INPP4B, MAL, ANK3, IGF1R, MLLT3, CSGALNACT1, TESPA1, PLCL1
ITGA6, LINC01550, AC139720.1, BCL2, SLC2A3, PRKN, PATJ, TSHZ2, LEF1-AS1, SESN3
Negative: NKG7, GNLY, PRF1, KLRD1, GZMB, GZMA, CST7, CCL5, FGFBP2, SPON2
GZMH, ADGRG1, KLRF1, TGFBR3, CCL4, CLIC3, PDGFD, PTGDR, FCRL6, C1orf21
IL2RB, PPP2R2B, BNC2, CTSW, TBX21, HOPX, FCGR3A, PYHIN1, CX3CR1, MYBL1
PC_ 4
Positive: PAX5, FCRL1, MS4A1, LINC00926, EBF1, OSBPL10, BANK1, CD79A, CD22, COL19A1
CD79B, PLEKHG1, ADAM28, AP002075.1, IGHD, MTSS1, FCER2, MCTP2, IGHM, SWAP70
AC120193.1, LARGE1, NKG7, GNLY, RALGPS2, FCRL2, FCRL3, PRF1, KLRD1, STEAP1B
Negative: LINC01478, CLEC4C, CUX2, EPHB1, LILRA4, COL26A1, PTPRS, SCN9A, LINC01374, LINC00996
AC023590.1, COL24A1, NRP1, PHEX, FAM160A1, TNFRSF21, PACSIN1, SLC35F3, PLXNA4, SCAMP5
PLD4, DNASE1L3, LRRC26, PLEKHD1, P3H2, SCN1A-AS1, ITM2C, SCT, PTCRA, SERPINF1
PC_ 5
Positive: CDKN1C, IFITM3, HES4, CSF1R, FCGR3A, SMIM25, MS4A7, LST1, AC104809.2, LRRC25
AIF1, AC020651.2, MAFB, CALHM6, CFD, GPBAR1, FMNL2, CTSL, SERPINA1, SIGLEC10
TCF7L2, HMOX1, MTSS1, CST3, HLA-DPA1, CKB, TMSB4X, COTL1, SPRED1, CASP5
Negative: PLCB1, VCAN, ARHGAP24, CSF3R, DYSF, LINC00937, VCAN-AS1, GLT1D1, ITGAM, FNDC3B
PADI4, EIF4G3, PDE4D, AC020916.1, CREB5, TREM1, MEGF9, CD36, ARHGAP26, FCAR
VAV3, MICAL2, CLMN, ACSL1, KIF13A, JUN, ANXA1, CRISPLD2, MCTP2, RAB11FIP1
39.6.2 ATAC reduction: why not PCA, and what TF-IDF + LSI do instead
The accessibility side needs a different reduction, and the reason is the structure of the data. Recall that the peak matrix is sparse and nearly binary: a cell has at most a couple of copies of any locus, so most entries are 0 and the rest are small. Log-normalization-then-PCA, which works well for the graded counts of RNA, is a poor fit for this on/off signal. The fix, worked out for the earliest single-cell ATAC experiments (Cusanovich et al. 2015), borrows two ideas from how search engines compare text documents: TF-IDF, then LSI.
39.6.2.1 TF-IDF, in miniature
TF-IDF stands for term frequency–inverse document frequency. The analogy is to text: if cells are “documents” and peaks are “words,” then a peak open in nearly every cell is like the word the — common, and therefore almost useless for telling documents apart — while a peak open in just a handful of cells is like a rare, distinctive word that pins down which documents resemble each other. TF-IDF rewrites each entry so that rare, informative peaks count for more and ubiquitous ones count for less.
The quickest way to feel this is to do it by hand on a tiny matrix — four cells, three peaks, 1 meaning “accessible”:
# rows = peaks, columns = cells; 1 = accessible in that cell
toy <- matrix(c(1, 1, 1, 1, # peak A: open everywhere (an uninformative "the")
1, 1, 0, 0, # peak B: open in half the cells
0, 0, 0, 1), # peak C: open in one cell (rare, distinctive)
nrow = 3, byrow = TRUE,
dimnames = list(c("peakA", "peakB", "peakC"),
c("c1", "c2", "c3", "c4")))
n_cells <- ncol(toy)
peak_freq <- rowSums(toy) # in how many cells is each peak open?
idf <- log(1 + n_cells / peak_freq) # inverse document frequency = the weight
round(idf, 2)peakA peakB peakC
0.69 1.10 1.61
The ubiquitous peak A gets the smallest weight and the rare peak C the largest — precisely the re-weighting we want before we go looking for structure. (Signac’s RunTFIDF() uses a refined version of this same idea; the toy captures the intuition, not the exact formula.)
39.6.2.2 LSI: a PCA for the re-weighted matrix
Once the peak matrix is re-weighted, LSI (latent semantic indexing) reduces it to a handful of components using a singular value decomposition (SVD) — mathematically the same machinery as PCA, applied to the TF-IDF matrix instead of a scaled expression matrix. Each component is a latent pattern of co-accessible peaks: a set of regulatory regions that tend to open together across cells. Where PCA on RNA finds axes of co-expressed genes, LSI on ATAC finds axes of co-accessible peaks — and those axes are what separate one cell state from another.
DefaultAssay(pbmc) <- "ATAC"
pbmc <- FindTopFeatures(pbmc, min.cutoff = 5) # keep peaks seen in >= 5 cells
pbmc <- RunTFIDF(pbmc) # re-weight (TF-IDF)Performing TF-IDF normalization
Warning in RunTFIDF.default(object = GetAssayData(object = object, layer =
"counts"), : Some features contain 0 total counts
pbmc <- RunSVD(pbmc, reduction.name = "lsi") # the SVD that makes it "LSI"Running SVD
Scaling cell embeddings
There is one wrinkle worth teaching because it trips people up: the first LSI component usually captures sequencing depth, not biology. The reason is mechanical — a cell sequenced more deeply has more fragments and therefore more nonzero peaks across the board, so the single largest axis of variation in the data is often just “how much was this cell sequenced,” not “what kind of cell is it.” We check for this with DepthCor(), which correlates each component against each cell’s total counts, and routinely drop component 1 (using dimensions 2:n instead of 1:n):
DepthCor(pbmc, reduction = "lsi")39.7 A joint embedding with weighted nearest neighbors
Here is the heart of vertical integration. We have two reductions of the same cells — a PCA from RNA and an LSI from ATAC — and we want one representation that uses both. The naive approach (just glue the two sets of components together) treats every cell as trusting both modalities equally. But that is rarely true: for some cells the RNA cleanly says “T cell” while the ATAC is murky, and for others the reverse.
Weighted nearest neighbors (WNN) (Hao et al. 2021) handles exactly this by letting every cell weight the two modalities for itself. The intuition is a panel of two advisors. Suppose you must sort people into groups by similarity, and you have two advisors: one judges by what people say (RNA), the other by what they are poised to do (ATAC). For some people the first advisor is far more reliable; for others, the second. The sensible strategy is to ask, for each person individually, which advisor’s notion of “similar people” better matches that person — and to trust that advisor more for that person.
That is essentially WNN’s algorithm. For each cell it asks: using this cell’s RNA neighbors, how well can I reconstruct the cell’s own RNA profile — and using its ATAC neighbors, how well can I reconstruct its own ATAC profile? Whichever modality predicts the cell better earns a higher weight for that cell. These per-cell weights (a number between 0 and 1 for each modality, the two summing to 1) then merge the two neighbor graphs into a single weighted nearest-neighbor graph, which we embed with UMAP and cluster exactly as we would a single-modality graph.
pbmc <- FindMultiModalNeighbors(
pbmc,
reduction.list = list("pca", "lsi"),
dims.list = list(1:50, 2:40), # RNA PCA 1:50; ATAC LSI drops component 1
modality.weight.name = "RNA.weight"
)Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
Calculating kernel bandwidths
Warning in FindMultiModalNeighbors(pbmc, reduction.list = list("pca", "lsi"), :
The number of provided modality.weight.name is not equal to the number of
modalities. RNA.weight ATAC.weight are used to store the modality weights
Finding multimodal neighbors
Constructing multimodal KNN graph
Constructing multimodal SNN graph
pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", assay = "RNA", reduction.name = "wnn.umap")Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:46:33 UMAP embedding parameters a = 0.9922 b = 1.112
15:46:34 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 20
15:46:36 Found 3 connected components, falling back to random initialization
15:46:36 Initializing from uniform random
15:46:36 Commencing optimization for 200 epochs, with 292842 positive edges
15:46:36 Using rng type: pcg
15:46:39 Optimization finished
pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, resolution = 0.8)Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 10032
Number of edges: 374830
Running smart local moving algorithm...
Maximum modularity in 10 random starts: 0.9056
Number of communities: 20
Elapsed time: 4 seconds
39.7.1 What did the joint view buy us?
The payoff is clearest when we put all three embeddings side by side — RNA alone, ATAC alone, and the WNN joint view — and look for cell populations that one modality blurs but the other (and so the joint view) resolves:
pbmc <- RunUMAP(pbmc, reduction = "pca", dims = 1:50, reduction.name = "rna.umap")15:46:44 UMAP embedding parameters a = 0.9922 b = 1.112
15:46:44 Read 10032 rows and found 50 numeric columns
15:46:44 Using Annoy for neighbor search, n_neighbors = 30
15:46:44 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:46:45 Writing NN index file to temp file /tmp/RtmpXrGGdm/file5760f50440
15:46:45 Searching Annoy index using 1 thread, search_k = 3000
15:46:48 Annoy recall = 100%
15:46:48 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:46:50 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
15:46:50 Using 'irlba' for PCA
15:46:50 PCA: 2 components explained 51.63% variance
15:46:50 Scaling init to sdev = 1
15:46:50 Commencing optimization for 200 epochs, with 447124 positive edges
15:46:50 Using rng type: pcg
15:46:54 Optimization finished
pbmc <- RunUMAP(pbmc, reduction = "lsi", dims = 2:40, reduction.name = "atac.umap")15:46:54 UMAP embedding parameters a = 0.9922 b = 1.112
15:46:54 Read 10032 rows and found 39 numeric columns
15:46:54 Using Annoy for neighbor search, n_neighbors = 30
15:46:54 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:46:55 Writing NN index file to temp file /tmp/RtmpXrGGdm/file571052de
15:46:55 Searching Annoy index using 1 thread, search_k = 3000
15:46:57 Annoy recall = 100%
15:46:58 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:47:00 Initializing from normalized Laplacian + noise (using RSpectra)
15:47:00 Commencing optimization for 200 epochs, with 393226 positive edges
15:47:00 Using rng type: pcg
15:47:04 Optimization finished
We can also look directly at what WNN learned — the per-cell modality weights. Each cell received an RNA weight (and an ATAC weight equal to 1 − the RNA weight); plotting the RNA weight by cluster shows which cell types lean on expression and which lean on accessibility to define themselves:
VlnPlot(pbmc, features = "RNA.weight", group.by = "seurat_clusters", pt.size = 0)39.8 Downstream: linking a peak to the gene it regulates
The unique payoff of paired data is that we can connect a specific open region to a specific gene’s expression in the same cells. The honest, transparent way to see the idea is to compute it by hand for one locus: take a cell-type marker gene, find the accessible peaks near it, and ask whether a peak’s accessibility tracks the gene’s expression across cells. A peak that does is a candidate regulatory element for that gene.
We’ll use MS4A1 (CD20), a canonical B-cell marker. The logic is three steps: find the gene, find the peaks near it, then correlate.
Step 1 — locate the gene. We pull MS4A1’s coordinates from an Ensembl annotation and relabel its chromosome in the chr11 (UCSC) style the ATAC peaks use, then widen to a ±100 kb window — wide enough to catch enhancers, which often sit well outside the gene body. One wrinkle: many gene symbols match two rows in Ensembl — the real protein_coding gene plus an LRG_gene curated-reference duplicate at the same locus. We keep only the protein-coding row so every peak is counted once:
library(EnsDb.Hsapiens.v86)
gene <- "MS4A1"
all.genes <- genes(EnsDb.Hsapiens.v86) # every gene, as a GRanges
gene.gr <- all.genes[all.genes$gene_name == gene &
all.genes$gene_biotype == "protein_coding"] # one row
seqlevelsStyle(gene.gr) <- "UCSC" # "11" -> "chr11"Warning in (function (seqlevels, genome, new_style) : cannot switch some
GRCh38's seqlevels from NCBI to UCSC style
window <- resize(gene.gr, width = width(gene.gr) + 2e5, fix = "center")
windowGRanges object with 1 range and 6 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000156738 chr11 60355752-60570760 + | ENSG00000156738
gene_name gene_biotype seq_coord_system symbol
<character> <character> <character> <character>
ENSG00000156738 MS4A1 protein_coding chromosome MS4A1
entrezid
<list>
ENSG00000156738 931
-------
seqinfo: 357 sequences (1 circular) from 2 genomes (hg38, GRCh38)
Step 2 — find nearby peaks. The ChromatinAssay carries each peak’s coordinates, so we ask which peaks fall inside the window. We index back into the assay’s own peak names so the identifiers match exactly:
peak.gr <- granges(pbmc[["ATAC"]])
idx <- queryHits(findOverlaps(peak.gr, window))
peak.ids <- rownames(pbmc[["ATAC"]])[idx]
length(peak.ids) # how many peaks sit near MS4A1?[1] 23
Step 3 — correlate accessibility with expression. For each nearby peak we compute the Spearman correlation, across all 10,032 cells, between that peak’s accessibility and MS4A1’s expression. A peak whose opening rises and falls with the gene’s expression is a candidate regulatory element for it:
acc <- GetAssayData(pbmc, assay = "ATAC", layer = "data")[peak.ids, , drop = FALSE]
expr <- GetAssayData(pbmc, assay = "RNA", layer = "data")[gene, ]
peak.cor <- apply(acc, 1, function(a) cor(a, expr, method = "spearman"))
sort(peak.cor, decreasing = TRUE)[1:10] # the top candidate linkschr11-60455290-60456088 chr11-60498212-60499391 chr11-60476867-60477719
0.5861696 0.3864819 0.3556311
chr11-60457557-60458006 chr11-60485853-60486383 chr11-60357719-60358258
0.2812676 0.2474239 0.1842331
chr11-60396248-60397424 chr11-60459521-60459691 chr11-60376380-60382255
0.1827865 0.1327127 0.1218045
chr11-60366868-60368023
0.1187863
The most strongly correlated peak is typically the gene’s own promoter; the others are candidate enhancers that open in the same cells that express MS4A1 — precisely the sort of regulatory link that only paired measurement, in the same cells, can reveal.
A correlation coefficient is a single number; it helps to see what it summarizes. Below we plot, cell by cell, MS4A1 expression against accessibility — for the promoter peak (the strongest link) and, for contrast, a nearby peak whose accessibility is essentially uncorrelated with expression. Single-cell data are sparse, so most cells register neither signal and pile up at the origin; rather than overplot ~10,000 semi-transparent points we bin them into a 2-D histogram with a log-scaled colour:
library(ggplot2)
library(patchwork)
top.peak <- names(sort(peak.cor, decreasing = TRUE))[1] # strongest link (promoter)
null.peak <- names(peak.cor)[which.min(abs(peak.cor))] # nearby peak, no link
link_panel <- function(peak, tag) {
df <- data.frame(accessibility = acc[peak, ], expression = expr)
ggplot(df, aes(accessibility, expression)) +
geom_bin2d(bins = 40) +
scale_fill_viridis_c(trans = "log10", name = "cells") +
labs(title = tag,
subtitle = sprintf("%s (Spearman ρ = %.2f)", peak, peak.cor[peak]),
x = "ATAC accessibility (TF-IDF)",
y = "MS4A1 expression (log-normalised)") +
theme_minimal(base_size = 10)
}
link_panel(top.peak, "Promoter: accessibility tracks expression") +
link_panel(null.peak, "Nearby peak: no link")Production tools formalize this step: Signac::LinkPeaks() adds GC-content and background correction (at the cost of a BSgenome and the fragment file), and full gene-regulatory-network methods (SCENIC+, Pando) chain peak→gene links with transcription-factor motifs to reconstruct regulatory circuits. We compute the bare correlation here so the concept is unmistakable.
39.9 Summary
- RNA and ATAC are complementary: expression is the output, accessibility is the regulatory wiring — and a cell can open a locus before it transcribes the gene.
- Know your integration shape first. This was a vertical problem: a multiome assay handed us the cell correspondence, so the task was to combine two views, not to match cells.
- The two modalities live in different feature spaces (genes vs. peaks), and the sparse, near-binary ATAC matrix needs TF-IDF + LSI, not the log-PCA recipe used for RNA — and its first LSI component usually encodes depth, not biology.
- WNN builds a single embedding by learning, per cell, how much to trust each modality — recovering structure that either modality alone can blur.
- Paired data uniquely lets you link peaks to genes in the same cells, the first step toward reconstructing regulatory logic.
39.10 What’s next: the unpaired case
This chapter handled the vertical problem, where the multiome assay handed us the cell correspondence for free. The harder and very common case is diagonal integration — a separate scRNA dataset and scATAC dataset with no shared cells — where we must infer the correspondence ourselves. The next chapter tackles it with anchor-based label transfer, and uses a clever trick: we take this very multiome dataset, throw away the pairing, and try to recover it — giving us a known ground truth to grade the integration against.



