Skip to contents

A major barrier to reusing public RNA-seq data has been that every submitter processes raw reads differently — different aligners, references, and counting rules — so counts from two studies are rarely directly comparable. To address this, NCBI computes a uniform set of RNA-seq quantifications for many GEO Series using a single consistent pipeline. This article explains what those are and how GEOquery retrieves them.

NCBI-computed quantifications vs. submitter files

There are two distinct things you might mean by “the counts” for a GEO RNA-seq study:

  1. Submitter-provided processed files — whatever the authors uploaded as supplementary files. Formats and gene models vary by study; GEOquery lists and downloads them (getGEOSuppFiles()) but does not interpret them.
  2. NCBI-computed quantifications — a raw-count matrix (and accompanying annotation) that NCBI generated uniformly from the study’s SRA runs. These are comparable across studies and are what the GEOquery RNA-seq functions target.

Not every Series has NCBI quantifications; you can check first.

Checking availability and loading

library(GEOquery)

# Does this Series have NCBI-computed quantifications?
hasRNASeqQuantifications("GSE164073")

# Load them as a SummarizedExperiment
se <- getRNASeqData("GSE164073")
se

The result is a SummarizedExperiment: assay(se) is the raw count matrix, rowData(se) carries gene annotation, and colData(se) the sample metadata. Because it is a SummarizedExperiment, it drops straight into the differential expression workflows described in the downstream analysis article.

Why raw counts (and what to do with them)

The NCBI quantifications are deliberately raw counts, not normalized values, because the correct normalization depends on your downstream method. Count-based differential expression tools expect raw counts and model the counting noise themselves:

See downstream analysis for a worked sketch.

When you need the submitter’s files instead

If a study has no NCBI quantifications, or you specifically need the authors’ processed output (e.g. TPM, or a custom gene model), fall back to the supplementary files and read them yourself:

getGEOSuppFiles("GSE164073", fetch_files = FALSE)   # inspect first