23 Accessing and working with public omics data

Author

Affiliation

Sean Davis

University of Colorado
Anschutz School of Medicine

Published

June 1, 2024

Modified

June 12, 2025

23.1 Background

The data we are going to access are from this paper.

Background: The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissue.Methods: We analyzed public and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 IHC, and our gene signatures.Results: We observed that regulatory T cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissue. A tumor-specific CD8 signature was slightly lower in tumor tissue compared with normal of most (12 of 16) cancers, whereas a Treg signature was higher in tumor tissue of all cancers except liver. Clustering by Treg signature found two groups in colorectal cancer datasets. The high Treg cluster had more samples that were consensus molecular subtype 1/4, right-sided, and microsatellite-instable, compared with the low Treg cluster. Finally, we found that the correlation between signature and IHC was low in our small dataset, but samples in the high Treg cluster had significantly more CD8+ and FoxP3+ cells compared with the low Treg cluster.Conclusions: Treg gene expression is highly indicative of the overall tumor immune environment.Impact: In comparison with the consensus molecular subtype and microsatellite status, the Treg signature identifies more colorectal tumors with high immune activation that may benefit from cancer immunotherapy.

In this little exercise, we will:

Access public omics data using the GEOquery package
Get an opportunity to work with another SummarizedExperiment object.
Perform a simple unsupervised analysis to visualize these public data.

23.2 GEOquery to PCA

The first step is to install the R package GEOquery. This package allows us to access data from the Gene Expression Omnibus (GEO) database. GEO is a public repository of omics data.

BiocManager::install("GEOquery")

GEOquery has only one commonly used function, getGEO() which takes a GEO accession number as an argument. The GEO accession number is a unique identifier for a dataset.

Use the GEOquery package to fetch data about GSE103512.

library(GEOquery)
gse = getGEO("GSE103512")[[1]]

You might ask why we are using [[1]] at the end of the getGEO() function. The reason is that getGEO() returns a list of GSE objects. We are only interested in the first one (and in this case, the only one). We return a list of GSE objects because in the early days, it was not unusual to have a single GEO accession number represent multiple datasets. While uncommon now, we’ve kept the convention since lots of “older” data is still quite useful.

Again, a historically-derived detail, is to convert from the older Bioconductor data structure (GEOquery was written in 2007), the ExpressionSet, to the newer SummarizedExperiment.

library(SummarizedExperiment)
se = as(gse, "SummarizedExperiment")

Use some code to determine the answers to the following:

What is the class of se?
What are the dimensions of se?
What are the dimensions of the assay slot of se?
What are the dimensions of the colData slot of se?
What variables are in the colData slot of se?

Examine two variables of interest, cancer type and tumor/normal status. The with function is a convenience to allow us to access variables in a data frame by name (rather than having to do dataframe$variable_name. Recalling that the table function is a convenient way to summarize the counts of unique values in a vector, we can use with to access the variables of interest and table to summarize the counts of unique values.

with(colData(se),table(`cancer.type.ch1`,`normal.ch1`))

               normal.ch1
cancer.type.ch1 no yes
          BC    65  10
          CRC   57  12
          NSCLC 60   9
          PCA   60   7

How many samples are there of each cancer type?
How many samples are there of each tumor/normal status?

When performing unsupervised analysis, it is common to filter genes by variance to find the most informative genes. It is common practice to filter genes by standard deviation or some other measure of variability and keep the top X percent of them when performing dimensionality reduction. There is not a single right answer to what percentage to use, so try a few to see what happens. In the example code, I chose to use the top 500 genes by standard deviation, but you can play with the threshold to see what happens.

Recall that the assay function is used to access the data matrix of the SummarizedExperiment object.

Think through the code below and then run it.

sds = apply(assay(se, 'exprs'),1,sd)
dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]

If you don’t recognize the function apply, it is a function that applies a function to each row or column of a matrix. In this case, we are applying the sd function to each row of the data matrix. The order function is used to sort the standard deviations in decreasing order (when decreasing=TRUE). And the [1:500] is used to subset the data matrix to the top 500 genes by standard deviation.

Perform PCA and prepare for plotting. We will be using ggplot2, so we need to make a data.frame before plotting.

pca_results <- prcomp(t(dat))
pca_df = as.data.frame(pca_results$x)
pca_df$Type=factor(colData(se)[,'cancer.type.ch1'])
pca_df$Normal = factor(colData(se)[,'normal.ch1'])

We can take a look at the variance explained by each principal component as a plot.

library(ggplot2)
pca_var <- pca_results$sdev^2 / sum(pca_results $sdev^2)
plot(pca_var, xlab = "Principal Component", ylab = "Variance Explained", 
     main = "Variance Explained by Principal Components", type = "b")

Now, we are going to plot the results of the PCA, coloring the points by cancer type and using different shapes for normal and tumor samples.

library(ggplot2)
ggplot(pca_df, aes(x=PC1,y=PC2,shape=Normal,color=Type)) + 
    geom_point( alpha=0.6) + theme(text=element_text(size = 18))

In this case, the x-axis is the first principal component and the y-axis is the second principal component.

What do you see?
What about additional principal components?
Bonus: Try using the GGally package to plot principal components (using the ggpairs function).
Bonus: Calculate the variance explained by each principal component and plot the results.