11 Microbiome analysis

The human microbiome is a complex ecosystem of bacteria, viruses, fungi, and other microorganisms that live in and on the human body. These microorganisms play a crucial role in human health and disease, influencing everything from digestion and metabolism to immune function and mental health. Understanding the composition and function of the human microbiome is a rapidly growing area of research with implications for a wide range of fields, including medicine, agriculture, and environmental science.

Microbiome analysis involves studying the microbial communities that inhabit different environments, such as the human gut, skin, mouth, and soil. The analysis of microbiome data typically involves quantifying the abundance of different microorganisms in a given sample and comparing the microbial communities across different samples or conditions. This can help researchers identify patterns, relationships, and associations between microbial communities and various factors, such as health status, diet, lifestyle, and environmental conditions.

In this chapter, we will assume that the data has already been processed and cleaned, and we will focus on analyzing the microbiome data using R.

11.1 Getting started

We’ll be using the mia package for microbiome analysis in R. The mia package provides a suite of tools for analyzing microbiome data, including functions for loading, processing, and visualizing microbiome data. The package is designed to work with the SummarizedExperiment and TreeSummarizedExperiment classes, which are specialized data structures for storing microbiome data in R.

To get started, you’ll need to install the mia package from GitHub using the BiocManager package. If you don’t already have the BiocManager package installed, you can install it using the following command:

# Install the mia package
BiocManager::install("microbiome/mia")

Note

Note: The mia package is part of the Bioconductor project, but in this case, we are installing it directly from GitHub using the microbiome/mia repository. This is a common practice when working with development versions of packages or with packages that are not yet available on Bioconductor.

Once the mia package is installed, you can load it into your R session using the following command:

# Load the mia package
library(mia)

Now that the mia package is loaded, we can load the microbiome dataset that we will be working with.

11.2 Accessing micriobiome datasets

There are many publicly available microbiome datasets that you can use for analysis. Bioconductor provides several packages that contain curated microbiome datasets that facilitate access to and loading microbiome data into R for analysis.

curatedMetagenomicData is a large collection of curated human microbiome datasets, provided as TreeSE objects (Pasolli et al. 2017). The resource provides curated human microbiome data including gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples from different body sites. See the package homepage for more details on data availability and access.

The microbiomeDataSets package provides a collection of curated microbiome datasets for teaching and research purposes. The package contains several example datasets that can be used to explore different aspects of microbiome analysis, such as alpha and beta diversity, differential abundance analysis, and visualization.

The mia package and several other packages provide example datasets for learning and testing the functions in the package. These datasets are typically small and easy to work with, making them ideal for learning how to analyze microbiome data in R.

The MicroBioMap package provides a collection of 170,000 microbiome samples homogeneously processed. The package is designed to facilitate the use of large-scale microbiome data for research and education purposes.

11.3 Data containers

Bioconductor provides several specialized data structures for storing and working with microbiome data in R. These data structures are designed to handle the complex and high-dimensional nature of microbiome data and provide a convenient and efficient way to store, manipulate, and analyze microbiome data in R.

SummarizedExperiment (SE) (Morgan et al. 2020) is a generic and highly optimized container for complex data structures. It has become a common choice for analyzing various types of biomedical profiling data, such as RNAseq, ChIp-Seq, microarrays, flow cytometry, proteomics, and single-cell sequencing.

TreeSummarizedExperiment (Huang 2020) was developed as an extension to incorporate hierarchical information (such as phylogenetic trees and sample hierarchies) and reference sequences.

11.4 Loading a microbiome dataset

We will be using the curatedMetagenomicData package to load the FengQ_2015 dataset. This dataset contains microbiome data from a study by Feng et al. (2015). The curatedMetagenomicData package provides a convenient interface for accessing microbiome datasets that have been pre-processed and curated for analysis.

Ignore the details of the code for now, but suffice it to say that what the code will do for us is to load a dataset for downstream analysis.

# Load the FengQ_2015 dataset
library(curatedMetagenomicData)
tse <- curatedMetagenomicData("FengQ_2015.relative_abundance", dryrun = FALSE, rownames = "short")[[1]]


$`2021-03-31.FengQ_2015.relative_abundance`
dropping rows without rowTree matches:
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
  k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Enorma|s__[Collinsella]_massiliensis
  k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillales_unclassified|g__Gemella|s__Gemella_bergeri
  k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
  k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
  k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
  k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
  k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis

This code goes out to the internet and downloads the dataset for you. The tse object is a TreeSummarizedExperiment object that contains the microbiome data from the Feng et al. (2015) study. The TreeSummarizedExperiment class is a specialized data structure for storing microbiome data in R, and it is used by the mia package for microbiome analysis.

Tip

If you are working with your own microbiome data, you will often need to load it into R using functions like read.csv() or read.table() to read tabular data files. Make sure your data is properly formatted and cleaned before loading it into R for analysis.

Once the data is loaded into R, we can start exploring the data to understand its structure and contents. The tse object is a TreeSummarizedExperiment object, which is a specialized data structure for storing microbiome data in R. It works quite like a data.frame in, but with many specialized structures for storing microbiome experiment data as shown in Figure 11.1. We can use various functions to explore the data stored in the tse object. See the TreeSummarizedExperiment package for more details.

This section provides an introduction to these data containers. In microbiome data science, these containers link taxonomic abundance tables with rich side information on the features and samples. Taxonomic abundance data can be obtained by 16S rRNA amplicon or metagenomic sequencing, phylogenetic microarrays, or by other means. Many microbiome experiments include multiple versions and types of data generated independently or derived from each other through transformation or agglomeration.

Figure 11.1: Anatomy of a TreeSummarizedExperiment object. Compared to the SingleCellExperiment objects, TreeSummarizedExperiment has five additional slots. **rowTree**: the hierarchical structure on the rows of the assays. **rowLinks**: the link information between rows of the assays and the rowTree. **colTree**: the hierarchical structure on the columns of the assays. **colLinks**: the link information between columns of the assays and the colTree. **referenceSeq** (optional): the reference sequence data per feature (row).

To get an overview of the data, we can simply type the name of the object in the R console:

# Print the object
tse

class: TreeSummarizedExperiment 
dim: 601 154 
metadata(1): agglomerated_by_rank
assays(1): relative_abundance
rownames(601): [Bacteroides] pectinophilus [Butyribacterium]
  methylotrophicum ... Weissella cibaria Weissella viridescens
rowData names(7): superkingdom phylum ... genus species
colnames(154): SID31004 SID31009 ... SID532832 SID532915
colData names(28): study_name subject_id ... ldl hba1c
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (601 rows)
rowTree: 1 phylo tree(s) (10430 leaves)
colLinks: NULL
colTree: NULL

Observe how the output of the tse object will show you the dimensions of the data, the metadata associated with the samples and features, etc. However, the data themselves are not printed to the console due to the large size of the dataset. The TreeSummarizedExperiment object is a complex data structure that contains multiple components, including the abundance data, sample information, feature information, and other metadata.

Next, we’ll look at some of the key components of the TreeSummarizedExperiment object and how to access and manipulate them.

11.4.1 Assay data

The microbiome is the collection of all microbes (such as bacteria, viruses, fungi, etc.) in the body. When studying these microbes, data is needed, and that’s where assays come in. An assay is a way of measuring the presence and abundance of different types of microbes in a sample. For example, if you want to know how many bacteria of a certain type are in your gut, you can use an assay to measure this. When storing assays, the original data is count-based. However, the original count-based taxonomic abundance tables may undergo different transformations, such as logarithmic, Centered Log-Ratio (CLR), or relative abundance. These are typically stored in assays.

The assays slot contains the experimental data as multiple count matrices. The result of assays is a list of matrices.

assays(tse)

List of length 1
names(1): relative_abundance

Individual assays can be accessed via assay() function. The result is a matrix.

assay(tse, "relative_abundance")[1:5, 1:5]

                                   SID31004 SID31009 SID31021 SID31030 SID31071
[Bacteroides] pectinophilus               0  0.35998  0.00000  0.00000        0
[Butyribacterium] methylotrophicum        0  0.00000  0.00000  0.00000        0
[Clostridium] hylemonae                   0  0.00000  0.00000  0.00000        0
[Clostridium] innocuum                    0  0.00371  0.00384  0.01735        0
[Clostridium] leptum                      0  0.00723  0.03135  0.03833        0

11.4.2 colData

colData contains information about the samples used in the study. This information can include details such as the sample ID, the primers used in the analysis, the barcodes associated with the sample (truncated or complete), the type of sample (e.g. soil, fecal, mock) and a description of the sample.

colData(tse)

DataFrame with 154 rows and 28 columns
           study_name  subject_id   body_site antibiotics_current_use
          <character> <character> <character>             <character>
SID31004   FengQ_2015    SID31004       stool                      no
SID31009   FengQ_2015    SID31009       stool                      no
SID31021   FengQ_2015    SID31021       stool                      no
SID31030   FengQ_2015    SID31030       stool                      no
SID31071   FengQ_2015    SID31071       stool                      no
...               ...         ...         ...                     ...
SID532796  FengQ_2015   SID532796       stool                      no
SID532802  FengQ_2015   SID532802       stool                      no
SID532826  FengQ_2015   SID532826       stool                      no
SID532832  FengQ_2015   SID532832       stool                      no
SID532915  FengQ_2015   SID532915       stool                      no
          study_condition                disease       age age_category
              <character>            <character> <integer>  <character>
SID31004              CRC CRC;fatty_liver;hype..        64        adult
SID31009          control fatty_liver;hyperten..        68       senior
SID31021          control                healthy        60        adult
SID31030          adenoma adenoma;fatty_liver;..        70       senior
SID31071          control            fatty_liver        68       senior
...                   ...                    ...       ...          ...
SID532796         control fatty_liver;hyperten..        73       senior
SID532802         control                healthy        68       senior
SID532826         control T2D;fatty_liver;hype..        78       senior
SID532832         adenoma adenoma;fatty_liver;..        68       senior
SID532915         control                healthy        43        adult
               gender     country non_westernized sequencing_platform
          <character> <character>     <character>         <character>
SID31004         male         AUT              no       IlluminaHiSeq
SID31009         male         AUT              no       IlluminaHiSeq
SID31021       female         AUT              no       IlluminaHiSeq
SID31030         male         AUT              no       IlluminaHiSeq
SID31071         male         AUT              no       IlluminaHiSeq
...               ...         ...             ...                 ...
SID532796      female         AUT              no       IlluminaHiSeq
SID532802        male         AUT              no       IlluminaHiSeq
SID532826        male         AUT              no       IlluminaHiSeq
SID532832      female         AUT              no       IlluminaHiSeq
SID532915      female         AUT              no       IlluminaHiSeq
          DNA_extraction_kit        PMID number_reads number_bases
                 <character> <character>    <integer>    <numeric>
SID31004               MoBio    25758642     40898340   3649611221
SID31009               MoBio    25758642     66107961   6196998053
SID31021               MoBio    25758642     60789126   5708593447
SID31030               MoBio    25758642     50300253   4741158330
SID31071               MoBio    25758642     51945426   4913627034
...                      ...         ...          ...          ...
SID532796              MoBio    25758642     50845712   4754848864
SID532802              MoBio    25758642     41480415   3868696049
SID532826              MoBio    25758642     35346002   3348368467
SID532832              MoBio    25758642     42184599   3951224696
SID532915              MoBio    25758642     51594677   4886998833
          minimum_read_length median_read_length      NCBI_accession
                    <integer>          <integer>         <character>
SID31004                   30                 93 ERR688505;ERR688358
SID31009                   30                 96 ERR688506;ERR688359
SID31021                   30                 96 ERR688507;ERR688360
SID31030                   30                 96 ERR688508;ERR688361
SID31071                   30                 97 ERR688509;ERR688362
...                       ...                ...                 ...
SID532796                  30                 95 ERR710428;ERR710419
SID532802                  30                 96 ERR710429;ERR710420
SID532826                  30                 97 ERR710430;ERR710421
SID532832                  30                 96 ERR710431;ERR710422
SID532915                  30                 96 ERR710432;ERR710423
                         curator       BMI        diet disease_subtype
                     <character> <numeric> <character>     <character>
SID31004  Paolo_Manghi;Marisa_..     29.35  vegetarian       carcinoma
SID31009  Paolo_Manghi;Marisa_..     32.00    omnivore              NA
SID31021  Paolo_Manghi;Marisa_..     22.10    omnivore              NA
SID31030  Paolo_Manghi;Marisa_..     34.11    omnivore advancedadenoma
SID31071  Paolo_Manghi;Marisa_..     23.45    omnivore              NA
...                          ...       ...         ...             ...
SID532796 Paolo_Manghi;Marisa_..     26.56    omnivore              NA
SID532802 Paolo_Manghi;Marisa_..     23.53    omnivore              NA
SID532826 Paolo_Manghi;Marisa_..     31.22    omnivore              NA
SID532832 Paolo_Manghi;Marisa_..     27.55    omnivore advancedadenoma
SID532915 Paolo_Manghi;Marisa_..     22.65    omnivore              NA
                  tnm triglycerides       hdl       ldl     hba1c
          <character>     <numeric> <numeric> <numeric> <numeric>
SID31004       t1n0m0           172        28        92       5.2
SID31009           NA           101        50       157        NA
SID31021           NA            53        60       122        NA
SID31030           NA            89        74       146        NA
SID31071           NA           258        40       231        NA
...               ...           ...       ...       ...       ...
SID532796          NA           100        61       114       5.4
SID532802          NA            80        67       158       5.8
SID532826          NA           212        32        90       6.9
SID532832          NA           141        51       184       5.6
SID532915          NA            51        80       132       5.1

As you can see, there is a lot of information stored in the colData slot, including sample IDs, study conditions, and other metadata associated with the samples. This information is essential for understanding the context of the microbiome data and for performing downstream analyses.

11.4.3 rowData

rowData contains data on the features of the analyzed samples. This is particularly important in the microbiome field for storing taxonomic information. This taxonomic information is extremely important for understanding the composition and diversity of the microbiome in each sample analyzed. It enables identification of the different types of microorganisms present in samples. It also allows you to explore the relationships between microbiome composition and various environmental or health factors.

head(rowData(tse))

DataFrame with 6 rows and 7 columns
                                   superkingdom      phylum            class
                                    <character> <character>      <character>
[Bacteroides] pectinophilus            Bacteria  Firmicutes       Clostridia
[Butyribacterium] methylotrophicum     Bacteria  Firmicutes       Clostridia
[Clostridium] hylemonae                Bacteria  Firmicutes       Clostridia
[Clostridium] innocuum                 Bacteria  Firmicutes Erysipelotrichia
[Clostridium] leptum                   Bacteria  Firmicutes       Clostridia
[Clostridium] methylpentosum           Bacteria  Firmicutes       Clostridia
                                                order              family
                                          <character>         <character>
[Bacteroides] pectinophilus             Eubacteriales                  NA
[Butyribacterium] methylotrophicum      Eubacteriales      Clostridiaceae
[Clostridium] hylemonae                 Eubacteriales     Lachnospiraceae
[Clostridium] innocuum             Erysipelotrichales Erysipelotrichaceae
[Clostridium] leptum                    Eubacteriales    Oscillospiraceae
[Clostridium] methylpentosum            Eubacteriales    Oscillospiraceae
                                                    genus
                                              <character>
[Bacteroides] pectinophilus                            NA
[Butyribacterium] methylotrophicum            Clostridium
[Clostridium] hylemonae                 Lachnoclostridium
[Clostridium] innocuum             Erysipelatoclostridium
[Clostridium] leptum                                   NA
[Clostridium] methylpentosum                           NA
                                                  species
                                              <character>
[Bacteroides] pectinophilus        [Bacteroides] pectin..
[Butyribacterium] methylotrophicum [Butyribacterium] me..
[Clostridium] hylemonae            [Clostridium] hylemo..
[Clostridium] innocuum             [Clostridium] innocuum
[Clostridium] leptum                 [Clostridium] leptum
[Clostridium] methylpentosum       [Clostridium] methyl..

11.4.4 rowTree

Phylogenetic trees also play an important role in the microbiome field. The TreeSE class can keep track of features and node relations via two functions, rowTree and rowLinks.

A tree can be accessed via rowTree() as phylo object.

Phylogenetic trees

The phylogenetic tree is a branching diagram or “tree” showing the inferred evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. The tree of life is a phylogenetic tree that shows the evolutionary relationships among all living organisms on Earth.

In the context of microbiome analysis, phylogenetic trees are used to represent the evolutionary relationships between different microbial taxa based on their genetic sequences. These trees can help researchers understand the diversity and relatedness of different microbes in a sample and provide insights into the evolutionary history of the microbial community.

The phylo class in R is used to represent phylogenetic trees and provides functions for working with and visualizing these trees. The ggtree package is a popular package for visualizing phylogenetic trees in R and provides a wide range of options for customizing the appearance of the tree.

rowTree(tse)


Phylogenetic tree with 10430 tips and 10429 internal nodes.

Tip labels:
  k__Archaea|p__Candidatus_Micrarchaeota|c__Candidatus_Micrarchaeota_unclassified|o__Candidatus_Micrarchaeota_unclassified|f__Candidatus_Micrarchaeota_unclassified|g__Candidatus_Micrarchaeota_unclassified|s__Candidatus_Micrarchaeota_archaeon_CG1_02_55_22, k__Archaea|p__Archaea_unclassified|c__Archaea_unclassified|o__Archaea_unclassified|f__Archaea_unclassified|g__Archaea_unclassified|s__archaeon_GW2011_AR15, k__Archaea|p__Candidatus_Diapherotrites|c__Candidatus_Diapherotrites_unclassified|o__Candidatus_Diapherotrites_unclassified|f__Candidatus_Diapherotrites_unclassified|g__Candidatus_Diapherotrites_unclassified|s__Candidatus_Diapherotrites_archaeon_CG08_land_8_20_14_0_20_34_12, k__Archaea|p__Archaea_unclassified|c__Archaea_unclassified|o__Archaea_unclassified|f__Archaea_unclassified|g__Archaea_unclassified|s__archaeon_GW2011_AR10, k__Archaea|p__Candidatus_Diapherotrites|c__Candidatus_Diapherotrites_unclassified|o__Candidatus_Diapherotrites_unclassified|f__Candidatus_Diapherotrites_unclassified|g__Candidatus_Diapherotrites_unclassified|s__Candidatus_Diapherotrites_archaeon_CG11_big_fil_rev_8_21_14_0_20_37_9, k__Archaea|p__Euryarchaeota|c__Euryarchaeota_unclassified|o__Euryarchaeota_unclassified|f__Euryarchaeota_unclassified|g__Euryarchaeota_unclassified|s__Euryarchaeota_archaeon_TMED173, ...

Rooted; includes branch lengths.

The rowTree slot contains information about the hierarchical structure of the data, such as the relationships between different microbial taxa based on their genetic sequences. This information can be used to explore the evolutionary relationships between different microbes and to visualize the diversity and relatedness of the microbial community in a sample.

We can visualize the phylogenetic tree using the ggtree package in R. The ggtree package provides functions for visualizing phylogenetic trees in a variety of formats, including circular, rectangular, and radial layouts. Here, we’ll use the ggtree package to visualize the phylogenetic tree stored in the rowTree slot of the TreeSummarizedExperiment object.

library(ggtree)
ggtree(rowTree(tse))

See the ggtree package for more details on visualizing tree structures in R and, for more fun, the ggtree book.

The rowLink slot contains information about the relationships between the features and the tree structure. This information can be used to link the features in the data to the nodes in the tree and to explore the relationships between the features based on their abundance profiles.

rowLinks(tse)

LinkDataFrame with 601 rows and 5 columns
                                                  nodeLab nodeLab_alias
                                              <character>   <character>
[Bacteroides] pectinophilus        k__Bacteria|p__Firmi..    alias_3706
[Butyribacterium] methylotrophicum k__Bacteria|p__Firmi..    alias_3347
[Clostridium] hylemonae            k__Bacteria|p__Firmi..    alias_3662
[Clostridium] innocuum             k__Bacteria|p__Firmi..    alias_4453
[Clostridium] leptum               k__Bacteria|p__Firmi..    alias_3525
...                                                   ...           ...
Veillonella sp. T11011-6           k__Bacteria|p__Firmi..    alias_3074
Veillonella tobetsuensis           k__Bacteria|p__Firmi..    alias_3070
Victivallis vadensis               k__Bacteria|p__Lenti..    alias_5062
Weissella cibaria                  k__Bacteria|p__Firmi..    alias_4775
Weissella viridescens              k__Bacteria|p__Firmi..    alias_4773
                                     nodeNum    isLeaf   whichTree
                                   <integer> <logical> <character>
[Bacteroides] pectinophilus             3706      TRUE       phylo
[Butyribacterium] methylotrophicum      3347      TRUE       phylo
[Clostridium] hylemonae                 3662      TRUE       phylo
[Clostridium] innocuum                  4453      TRUE       phylo
[Clostridium] leptum                    3525      TRUE       phylo
...                                      ...       ...         ...
Veillonella sp. T11011-6                3074      TRUE       phylo
Veillonella tobetsuensis                3070      TRUE       phylo
Victivallis vadensis                    5062      TRUE       phylo
Weissella cibaria                       4775      TRUE       phylo
Weissella viridescens                   4773      TRUE       phylo

Both rowTree and rowLinks are optional components of the TreeSummarizedExperiment object, but when present, they provide valuable information about the hierarchical structure of the data and the relationships between the features in the data.

11.5 Wrangling and subsetting

The mia package provides several functions for wrangling and subsetting microbiome data. These functions allow you to filter, transform, and manipulate the data to extract the information you need for analysis.

11.5.1 Subsetting samples

You can subset the samples in a TreeSummarizedExperiment object similarly to how you would subset a data.frame. For example, to subset out data based on the age category of the samples, you can use the following code:

head(tse$age)

[1] 64 68 60 70 68 66

Age is a numeric variable. Let’s focus on older patients.

tse_subset_by_age <- tse[, tse$age > 50]
tse_subset_by_age

class: TreeSummarizedExperiment 
dim: 601 146 
metadata(1): agglomerated_by_rank
assays(1): relative_abundance
rownames(601): [Bacteroides] pectinophilus [Butyribacterium]
  methylotrophicum ... Weissella cibaria Weissella viridescens
rowData names(7): superkingdom phylum ... genus species
colnames(146): SID31004 SID31009 ... SID532826 SID532832
colData names(28): study_name subject_id ... ldl hba1c
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (601 rows)
rowTree: 1 phylo tree(s) (10430 leaves)
colLinks: NULL
colTree: NULL

11.5.2 Agglomerating data

The microbiome features (organisms) that are measured may be measured at different taxonomic levels. For example, the data may be available at the species, genus, family, or phylum level. Agglomerating data is the process of summarizing the data at a higher taxonomic level by combining the abundances of the lower-level taxa. This can be useful for simplifying the data and reducing the dimensionality of the data.

The agglomerateByRank function in the mia package can be used to agglomerate the data at a specified taxonomic rank. For example, to agglomerate the data at the phylum level, you can use the following code:

new_tse <- agglomerateByRank(tse, rank = "phylum")

Warning: The following values are already present in `metadata` and will be
overwritten: 'agglomerated_by_rank'. Consider using the 'name' argument to
specify alternative names.

# observe the new object, particularly the number of features (rows)
new_tse

class: TreeSummarizedExperiment 
dim: 13 154 
metadata(1): agglomerated_by_rank
assays(1): relative_abundance
rownames(13): Actinobacteria Ascomycota ... Synergistetes
  Verrucomicrobia
rowData names(7): superkingdom phylum ... genus species
colnames(154): SID31004 SID31009 ... SID532832 SID532915
colData names(28): study_name subject_id ... ldl hba1c
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: a LinkDataFrame (13 rows)
rowTree: 1 phylo tree(s) (10430 leaves)
colLinks: NULL
colTree: NULL

The resulting TreeSummarizedExperiment object contains the data aggregated at the phylum level, with the abundances of the lower-level taxa combined to create a summary at the phylum level. This can make the data easier to work with and interpret, especially when dealing with large and complex microbiome datasets.

11.6 Community indices

In the field of microbiome ecology several indices to describe samples and community of samples are available. In this vignette we just want to give a very brief introduction.

11.6.1 Alpha diversity

diversity in microbiology is measured by several indices:

species richness (total number of species)
equitability (distribution of species within a microbiome)
diversity (combination of the two)

Functions for calculating alpha and beta diversity indices are available. Using addAlpha multiple diversity indices are calculated by default and results are stored automatically in colData.

In the code below we calculate the Shannon and observed diversity indices.

tse <- addAlpha(tse, index = "shannon", assay.type = "relative_abundance")
tse <- addAlpha(tse, index = "observed", assay.type = "relative_abundance")
colnames(colData(tse))

 [1] "study_name"              "subject_id"             
 [3] "body_site"               "antibiotics_current_use"
 [5] "study_condition"         "disease"                
 [7] "age"                     "age_category"           
 [9] "gender"                  "country"                
[11] "non_westernized"         "sequencing_platform"    
[13] "DNA_extraction_kit"      "PMID"                   
[15] "number_reads"            "number_bases"           
[17] "minimum_read_length"     "median_read_length"     
[19] "NCBI_accession"          "curator"                
[21] "BMI"                     "diet"                   
[23] "disease_subtype"         "tnm"                    
[25] "triglycerides"           "hdl"                    
[27] "ldl"                     "hba1c"                  
[29] "shannon"                 "observed"

The observed index is a simple count of the number of species present in a sample, while the shannon index takes into account both the number of species and their relative abundances. These indices provide information about the diversity and richness of the microbial community in each sample.

And we can look at the resulting colData() which now contains the calculated alpha diversity indices.

colData(tse)

DataFrame with 154 rows and 30 columns
           study_name  subject_id   body_site antibiotics_current_use
          <character> <character> <character>             <character>
SID31004   FengQ_2015    SID31004       stool                      no
SID31009   FengQ_2015    SID31009       stool                      no
SID31021   FengQ_2015    SID31021       stool                      no
SID31030   FengQ_2015    SID31030       stool                      no
SID31071   FengQ_2015    SID31071       stool                      no
...               ...         ...         ...                     ...
SID532796  FengQ_2015   SID532796       stool                      no
SID532802  FengQ_2015   SID532802       stool                      no
SID532826  FengQ_2015   SID532826       stool                      no
SID532832  FengQ_2015   SID532832       stool                      no
SID532915  FengQ_2015   SID532915       stool                      no
          study_condition                disease       age age_category
              <character>            <character> <integer>  <character>
SID31004              CRC CRC;fatty_liver;hype..        64        adult
SID31009          control fatty_liver;hyperten..        68       senior
SID31021          control                healthy        60        adult
SID31030          adenoma adenoma;fatty_liver;..        70       senior
SID31071          control            fatty_liver        68       senior
...                   ...                    ...       ...          ...
SID532796         control fatty_liver;hyperten..        73       senior
SID532802         control                healthy        68       senior
SID532826         control T2D;fatty_liver;hype..        78       senior
SID532832         adenoma adenoma;fatty_liver;..        68       senior
SID532915         control                healthy        43        adult
               gender     country non_westernized sequencing_platform
          <character> <character>     <character>         <character>
SID31004         male         AUT              no       IlluminaHiSeq
SID31009         male         AUT              no       IlluminaHiSeq
SID31021       female         AUT              no       IlluminaHiSeq
SID31030         male         AUT              no       IlluminaHiSeq
SID31071         male         AUT              no       IlluminaHiSeq
...               ...         ...             ...                 ...
SID532796      female         AUT              no       IlluminaHiSeq
SID532802        male         AUT              no       IlluminaHiSeq
SID532826        male         AUT              no       IlluminaHiSeq
SID532832      female         AUT              no       IlluminaHiSeq
SID532915      female         AUT              no       IlluminaHiSeq
          DNA_extraction_kit        PMID number_reads number_bases
                 <character> <character>    <integer>    <numeric>
SID31004               MoBio    25758642     40898340   3649611221
SID31009               MoBio    25758642     66107961   6196998053
SID31021               MoBio    25758642     60789126   5708593447
SID31030               MoBio    25758642     50300253   4741158330
SID31071               MoBio    25758642     51945426   4913627034
...                      ...         ...          ...          ...
SID532796              MoBio    25758642     50845712   4754848864
SID532802              MoBio    25758642     41480415   3868696049
SID532826              MoBio    25758642     35346002   3348368467
SID532832              MoBio    25758642     42184599   3951224696
SID532915              MoBio    25758642     51594677   4886998833
          minimum_read_length median_read_length      NCBI_accession
                    <integer>          <integer>         <character>
SID31004                   30                 93 ERR688505;ERR688358
SID31009                   30                 96 ERR688506;ERR688359
SID31021                   30                 96 ERR688507;ERR688360
SID31030                   30                 96 ERR688508;ERR688361
SID31071                   30                 97 ERR688509;ERR688362
...                       ...                ...                 ...
SID532796                  30                 95 ERR710428;ERR710419
SID532802                  30                 96 ERR710429;ERR710420
SID532826                  30                 97 ERR710430;ERR710421
SID532832                  30                 96 ERR710431;ERR710422
SID532915                  30                 96 ERR710432;ERR710423
                         curator       BMI        diet disease_subtype
                     <character> <numeric> <character>     <character>
SID31004  Paolo_Manghi;Marisa_..     29.35  vegetarian       carcinoma
SID31009  Paolo_Manghi;Marisa_..     32.00    omnivore              NA
SID31021  Paolo_Manghi;Marisa_..     22.10    omnivore              NA
SID31030  Paolo_Manghi;Marisa_..     34.11    omnivore advancedadenoma
SID31071  Paolo_Manghi;Marisa_..     23.45    omnivore              NA
...                          ...       ...         ...             ...
SID532796 Paolo_Manghi;Marisa_..     26.56    omnivore              NA
SID532802 Paolo_Manghi;Marisa_..     23.53    omnivore              NA
SID532826 Paolo_Manghi;Marisa_..     31.22    omnivore              NA
SID532832 Paolo_Manghi;Marisa_..     27.55    omnivore advancedadenoma
SID532915 Paolo_Manghi;Marisa_..     22.65    omnivore              NA
                  tnm triglycerides       hdl       ldl     hba1c   shannon
          <character>     <numeric> <numeric> <numeric> <numeric> <numeric>
SID31004       t1n0m0           172        28        92       5.2   3.28881
SID31009           NA           101        50       157        NA   3.10400
SID31021           NA            53        60       122        NA   3.37028
SID31030           NA            89        74       146        NA   2.59465
SID31071           NA           258        40       231        NA   3.14486
...               ...           ...       ...       ...       ...       ...
SID532796          NA           100        61       114       5.4   3.65802
SID532802          NA            80        67       158       5.8   3.19414
SID532826          NA           212        32        90       6.9   3.15717
SID532832          NA           141        51       184       5.6   3.21046
SID532915          NA            51        80       132       5.1   3.48644
           observed
          <numeric>
SID31004        100
SID31009        105
SID31021        115
SID31030        105
SID31071        112
...             ...
SID532796       131
SID532802       109
SID532826       101
SID532832        93
SID532915       126

The scater package provides some convenient plotting functions that are designed to work with the SummarizedExperiment class.

library(scater)

Loading required package: scuttle

Loading required package: ggplot2


Attaching package: 'scater'

The following object is masked from 'package:ggtree':

    multiplot

plotColData(tse,
    "observed",
    "study_condition",
    colour_by = "gender"
) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(y = expression(Richness[Observed]))

11.6.2 Beta diversity

Community similarity refers to the way microorganisms resemble each other in terms of their composition and abundance of different microbial taxa. This can help us understand to what degree different samples resemble each other and finding key information. In microbiome analysis however, it’s more common to measure the dissimilarity/Beta diversity between two samples A and B using the Bray-Curtis measure which is defined as follows:

\[ BC_{ij} = \frac{\sum_{k} |A_{k} - B_{k}|}{\sum_{k} (A_{k} + B_{k})} \]

where \(A_{k}\) and \(B_{k}\) are the abundances of taxon \(k\) in samples A and B, respectively. The Bray-Curtis dissimilarity ranges from 0 (identical communities) to 1 (completely different communities).

The mia package provides functions for calculating beta diversity indices, such as the Bray-Curtis dissimilarity, Jaccard similarity, and UniFrac distance. These indices can be used to compare the microbial communities between different samples and to identify patterns and relationships between samples based on their microbial composition.

# Run PCoA on relabundance assay with Bray-Curtis distances
library(vegan)

tse <- runMDS(tse,
    FUN = vegdist,
    method = "bray",
    assay.type = "relative_abundance",
    name = "MDS_bray"
)

This code is a bit to unpack. It calculates the Bray-Curtis dissimilarity between samples in the tse object using the vegdist function from the vegan package. The resulting dissimilarity matrix is then used to perform a Principal Coordinate Analysis (PCoA) using the runMDS function from the scater package. The PCoA is a dimensionality reduction technique that projects the high-dimensional Bray-Curtis dissimilarity matrix onto a lower-dimensional space while preserving the pairwise distances between samples.

The resulting TreeSummarizedExperiment object now contains the Bray-Curtis dissimilarity matrix in the MDS_bray slot, which can be used to visualize the relationships between samples based on their microbial composition.

We can visualize the results of the PCoA using the plotReducedDim function from the scater package. This function creates a plot of the reduced dimensions (PCoA) and colors the samples based on a specified variable, such as the study condition

# Create ggplot object
p <- plotReducedDim(tse, "MDS_bray", colour_by = "study_condition")

print(p)

You can experiment with different variables to color the samples and explore the relationships between samples based on their microbial composition.

11.7 Microbial composition

Let’s now look at the microbial composition of the samples in the tse object. The microbial composition refers to the relative abundances of different microbial taxa in each sample. This information can provide insights into the diversity and structure of the microbial community in each sample and help identify patterns and relationships between samples based on their microbial composition.

The mia package provides functions for visualizing the microbial composition of samples, such as bar plots, heatmaps, and stacked bar plots. These plots can help you explore the relative abundances of different microbial taxa in each sample and identify patterns and relationships between samples based on their microbial composition.

11.7.1 Abundance

We can start by creating a bar plot of the top 40 most abundant taxa in the dataset. This plot shows the relative abundances of the top 40 taxa in each sample, with the taxa sorted by abundance.

library(miaViz)

Loading required package: ggraph

plotAbundanceDensity(tse,
    layout = "jitter",
    assay.type = "relative_abundance",
    n = 40
) +
    scale_x_log10(label = scales::percent)

Warning in getTopFeatures(object, top = n, assay.type = assay.type):
'getTopFeatures' is deprecated. Use 'getTop' instead.

Warning in scale_x_log10(label = scales::percent): log-10 transformation
introduced infinite values.

We can also look at the relative abundances of specific taxa in the dataset.

plotAbundanceDensity(tse,
    layout = "density",
    assay.type = "relative_abundance",
    n = 5, colour_by = "study_condition"
) +
    scale_x_log10()

Warning in getTopFeatures(object, top = n, assay.type = assay.type):
'getTopFeatures' is deprecated. Use 'getTop' instead.

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Warning: Removed 53 rows containing non-finite outside the scale range
(`stat_density()`).

11.7.2 Prevalence

Prevalence refers to the proportion of samples in which a taxon is present. This information can help identify the most common and rare taxa in the dataset and provide insights into the distribution of different taxa across samples.

Let’s agglomerate the data at the genus level and then plot the prevalence of the top 20 most prevalent genera in the dataset.

tse_genus <- agglomerateByRank(tse, rank = "genus")

Warning: The following values are already present in `metadata` and will be
overwritten: 'agglomerated_by_rank'. Consider using the 'name' argument to
specify alternative names.

head(getPrevalence(tse_genus,
    detection = 1 / 100,
    sort = TRUE, assay.type = "relative_abundance",
    as.relative = TRUE
))

   Bifidobacterium              Dorea   Faecalibacterium Mediterraneibacter 
         0.7857143          0.7662338          0.7337662          0.7272727 
      Anaerostipes            Blautia 
         0.7272727          0.7077922

11.8 Heatmaps

Heatmaps are a common way to visualize microbiome data and can provide insights into the relative abundances of different microbial taxa in each sample. To keep things simple, we’ll create a heatmap of the relative abundances of the phyla in the dataset.

We first agglomerate the data at the phylum level and then create a heatmap of the relative abundances of the phyla in the dataset.

library(pheatmap)
tse_phylum <- agglomerateByRank(tse, rank = "phylum")

Warning: The following values are already present in `metadata` and will be
overwritten: 'agglomerated_by_rank'. Consider using the 'name' argument to
specify alternative names.

pheatmap(assay(tse_phylum, "relative_abundance"))

11.9 Further fun

The mia package folks have created a book on microbome analysis in R that you can explore for more details on microbiome analysis in R. The book covers a wide range of topics related to microbiome analysis, including data loading, preprocessing, visualization, and statistical analysis. It provides detailed explanations and examples of how to work with microbiome data in R and is a valuable resource for anyone interested in learning more about microbiome analysis.