Inventory the single-cell supplementary files of a GEO Series or Sample
Source:R/singlecell.R
geoSingleCellManifest.RdLists the supplementary files attached to a GSE (or a single GSM) and classifies each by single-cell format (10x Matrix Market triplet, 10x HDF5, AnnData h5ad, loom, Seurat rds, tar archive, or other), extracting the GSM sample id where present. This lets you see what a single-cell study contains – and how 10x triplets group by sample – before downloading potentially many gigabytes.
Arguments
- GEO
A GEO Series (
"GSE...") or Sample ("GSM...") accession, e.g. "GSE132771" or "GSM3891612".- samples
Optional character vector of GSM ids. For a GSE, restricts the inventory to these samples; when the series level has no loadable units this also avoids enumerating the whole series. Ignored when
GEOis a GSM.
Value
A data.frame with columns fname, sample (GSM id or NA),
platform (GPL accession or NA), format, role, and
url. Zero rows if nothing is found.
Details
For a GSE, the series-level suppl directory is inventoried first. Many
single-cell studies ship only a GSE..._RAW.tar there, with the
loadable per-sample files (10x triplets, h5, h5ad) living in each sample's
own GSM suppl directory. When the series level yields no loadable units, the
manifest falls back to enumerating the series' samples (via
getGEO) and inventorying each GSM suppl directory. Pass a GSM
accession to inventory just that one sample.
No files are downloaded. The result feeds the single-cell readers (see ADR-0004); reading itself uses Bioconductor importers (TENxIO, anndataR) that are optional dependencies.
The platform column (GPL accession per sample) is populated when the
manifest is built from the GSM level – the common single-cell case, and the
one that matters, since a GSE can span multiple platforms (e.g. GSE132771
mixes mouse and human). It is the grouping used by
getGEOSingleCell(by = "platform"). It is NA for a single GSM,
and for whole-study files attached at the series level (which have no GSM).