BiocManager::install("GEOquery")25 Accessing public omics data with GEOquery
Someone has already run the experiment you need. Thousands of expression studies — microarray and RNA-seq, across nearly every tissue and disease — sit in a public archive called the Gene Expression Omnibus (GEO), free to download and reanalyze. Suppose you read a paper comparing tumor and normal tissue across several cancers and you want to see the data for yourself: pull it down, check how the samples group, and ask whether tumor and normal really separate. That is exactly the kind of reuse GEO exists for.
In this chapter you’ll do that end to end. You’ll fetch a real published dataset with the GEOquery package, load it into a SummarizedExperiment (the container from the previous chapter), keep the most variable genes, and run a principal component analysis (PCA) to visualize how the samples relate. The data come from a study of immune gene expression in matched normal and tumor tissue (Brouwer-Visser et al. 2018), deposited as accession GSE103512.
25.1 What you’ll learn
By the end of this chapter you will be able to:
- Download a published dataset from GEO with
getGEO(). - Convert the result into a
SummarizedExperimentand inspect its parts. - Summarize sample annotations with
table(). - Filter an expression matrix down to its most variable genes.
- Run PCA with
prcomp()and plot the samples to see how they group.
This chapter builds directly on the SummarizedExperiment chapter. If the words assay(), colData(), or “rows are genes, columns are samples” don’t feel familiar yet, read that chapter first — here we’ll lean on it rather than re-explain the object’s anatomy.
GEOquery is a Bioconductor package. Install it once with BiocManager; you won’t need to run this again:
25.2 Fetching a dataset from GEO
GEOquery has essentially one function you’ll reach for: getGEO(). You hand it a GEO accession number — the unique ID for a dataset, here "GSE103512" — and it downloads the data over the network.
[[1]] and the conversion later?
Two small historical quirks are worth a moment, because they trip up everyone the first time.
First, getGEO() returns a list of datasets, not a single one. In the early days a single accession could bundle several datasets, so the function always returns a list. That’s uncommon now, but the convention stayed because so much older data is still useful — hence the [[1]] to grab the first (and here, only) element.
Second, GEOquery was written in 2007 and hands you data in the older Bioconductor container, the ExpressionSet. We’ll convert it to the modern SummarizedExperiment in the next step so we can use the accessors you already know.
Now convert the ExpressionSet into a SummarizedExperiment with as():
library(SummarizedExperiment)
se <- as(gse, "SummarizedExperiment")
seclass: SummarizedExperiment
dim: 54715 280
metadata(3): experimentData annotation protocolData
assays(1): exprs
rownames(54715): 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-5_at
AFFX-TrpnX-M_at
rowData names(16): ID GB_ACC ... Gene.Ontology.Cellular.Component
Gene.Ontology.Molecular.Function
colnames(280): GSM2772660 GSM2772661 ... GSM2772938 GSM2772939
colData names(72): title geo_accession ... tumorsize.ch1 weight.ch1
Printing se confirms what you have: a SummarizedExperiment whose rows are genes (features) and whose columns are samples. The assay() holds the expression matrix, and colData() holds the per-sample annotations — the same three-part structure from the previous chapter.
dim(se)[1] 54715 280
So this experiment has 54715 genes measured across 280 samples. The expression values live in the assay named "exprs" (a leftover name from the ExpressionSet days), and you reach them with assay(se, "exprs").
25.3 Summarizing the sample annotations
Before any analysis, look at what the samples are. The interesting variables here are the cancer type and whether each sample is tumor or normal tissue. They live in colData(se) under the names cancer.type.ch1 and normal.ch1.
The table() function counts unique values, and with() lets you refer to those column names directly instead of typing colData(se)$... for each one:
normal.ch1
cancer.type.ch1 no yes
BC 65 10
CRC 57 12
NSCLC 60 9
PCA 60 7
Read this as a grid: each row is a cancer type, and the two columns split it into tumor (no, i.e. not normal) and normal samples. Most cancer types here have both tumor and matched normal tissue, with tumor samples outnumbering normals — exactly the design you’d want for a tumor-vs-normal comparison. A quick table like this is the cheapest possible sanity check that the dataset contains what the paper claimed.
25.4 Keeping the most variable genes
We want to see how the samples group. Most of the genes barely change from sample to sample, and those flat genes only add noise. A standard move before dimensionality reduction is to keep just the most variable genes — the ones actually carrying the signal that distinguishes samples.
PCA looks for the directions in which your samples differ most. A gene whose expression is nearly constant everywhere contributes almost nothing to those differences — it’s dead weight that dilutes the signal and slows the computation. Ranking genes by their standard deviation and keeping the top slice focuses PCA on the genes that vary, which usually makes the biological structure (tumor vs. normal, one cancer type vs. another) far easier to see. There’s no single correct cutoff; 500 genes is a reasonable starting point, and you’re encouraged to try others.
Compute each gene’s standard deviation with apply() (which applies a function across the rows or columns of a matrix — here, sd to each row, 1 meaning “by row”), then use order() to rank genes from most to least variable and keep the top 500:
order(sds, decreasing = TRUE) returns the row positions sorted by standard deviation, largest first; [1:500] takes the top 500; and we use those to subset the expression matrix. dat is now a 500-gene by 280-sample matrix.
25.5 Running and interpreting the PCA
PCA finds new axes (principal components) that capture the largest patterns of variation in the data. prcomp() expects samples as rows, so we transpose the matrix with t() before handing it over:
pca_results$x holds each sample’s coordinates along every principal component. We pull those into a data frame and attach the cancer type and tumor/normal label so we can color and shape the plot.
How much of the variation does each component capture? Each component’s variance is its squared standard deviation; divide by the total to get a fraction:

The curve drops off steeply: the first principal component (PC1) explains far more of the variation than any other, and after the first few components each one adds little. That steep drop is good news — it means a two-dimensional plot of PC1 and PC2 captures most of what’s going on, so it’s worth plotting.
library(ggplot2)
ggplot(pca_df, aes(x = PC1, y = PC2, shape = Normal, color = Type)) +
geom_point(alpha = 0.6) +
theme(text = element_text(size = 18))
In Figure 25.1 the x-axis is PC1 and the y-axis is PC2, each point is one sample, color marks the cancer type, and the shape marks tumor versus normal. The dominant pattern is cancer type: samples of the same color cluster together, so PC1 and PC2 are mostly separating tissues of origin from one another. Within each cancer type, the tumor and normal shapes tend to occupy slightly different regions, reflecting the tumor-vs-normal expression differences the original study set out to characterize. In other words, a handful of unsupervised components, computed from just 500 genes, already recover biology you’d expect: different cancers look different, and within a cancer, tumor looks different from normal.
25.6 Summary
You just reused a published dataset from start to finish. You:
- Downloaded GSE103512 from GEO with
getGEO()and grabbed the dataset with[[1]]. - Converted the legacy
ExpressionSetinto aSummarizedExperimentand inspected its dimensions and sample annotations. - Summarized the cancer-type-by-tumor/normal design with
table(). - Filtered to the 500 most variable genes, ran PCA with
prcomp(), and read the variance-explained plot. - Plotted PC1 vs. PC2 and saw the samples separate by cancer type, with tumor and normal differing within each type.
The same recipe — fetch, load into a SummarizedExperiment, filter, reduce, plot — works for thousands of other GEO datasets.
25.7 Exercises
What is the class of se, and what are the dimensions of its assay and colData? Which variables are stored in colData?
se is a SummarizedExperiment. The assay has one row per gene and one column per sample; colData has one row per sample and one column per annotation variable, including cancer.type.ch1 and normal.ch1.
Re-run the analysis keeping the top 100 genes, and again with the top 2000. Does the PC1-vs-PC2 plot change much?
The broad structure — samples separating by cancer type — is usually robust to the exact cutoff. Too few genes can lose finer distinctions; too many let low-variance noise creep back in. The stability of the big picture across cutoffs is itself reassuring.
ggpairs()
Use the GGally package’s ggpairs() to view the first few principal components together, colored by cancer type.
ggpairs() draws every pairwise combination of the chosen columns (here PC1–PC4) in a grid, so you can spot structure that PC1-vs-PC2 alone might hide — for example, a cancer type that only separates along PC3 or PC4.