The following script shows basic analyses of 16S rRNA amplicon
sequencing data.If needed RMarkdown file can be downloaded here.
Script includes some pre-processing steps, as well as calculations,
statistical analyses and visualizations. Raw paired-end read files were
used as input to generate de novo Amplicon Sequence Variants (ASVs) with
NG-Tax
2.0.
This sample dataset comes from a project aimed at studying the role of microbial communities on micropollutant removal in bark-biochar vertical subsurface flow constructed wetlands. Since conventional wastewater treatment plants are inefficient in removing micropollutants, there is a need for post-treatment methods that are feasible and have high removal efficiency. One of such treatment methods might be constructed wetlands (CW), which are man-made structures built to imitate the mechanisms of natural wetland systems, such as shallow ponds, mangroves, trenches, etc. In CW, a basin is filled with the solid material on which microorganisms can form biofilms. Subsequently, natural processes, such as biodegradation, sorption and plant uptake, result in the removal of micropollutants.
In this project CW basins were filled with a following solid filling material:
Sand (group CW1-sand) - which is conventional filling
Combination of sand and bark-biochar (built with an innovative layer of a combination of bark and biochar in between layers of sand) (CW2-biochar)
Layer of bark-biochar only (CW3-biochar)
The water was distributed over the top of the CW and subsequently kept in the system for 8 h. One of the aims of the study was to identify the microorganisms that are present in 3 types of constructed wetlands with different solid filling material over the depth of the constructed wetland. Samples were taken from 3 CW from the following depths:
30 cm
50 cm
70 cm
90 cm
All constructed wetlands were divided into 2 parts (A and D) and samples were taken from both locations. Additionally, samples were taken in winter and summer, since the temperature and other environmental factors might have the impact on the composition of the wetlands microbiome and therefore their functional properties.
knitr::opts_chunk$set(echo = TRUE, include=TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra = "")
library(rentrez)
library(taxize)
library(phyloseq)
library(microbiome)
library(vegan)
library(microViz)
library(ggtree)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(pals)
library(scales)
library(RColorBrewer)
library(forcats)
library(ggsci)
library(kableExtra)
library(rstatix)
library(here)
library(Maaslin2)
library(GGally)
library(patchwork)
library(ranacapa)
library(ComplexHeatmap)
library(ape)
library(Biostrings)
library(phylotools)
library(biomeUtils)
library(MicrobiomeStat)
library(dplyr)
library(plyr)
library(stringr)
library(ggVennDiagram)
Remember to indicate the directory where all your files will be saved. The working directory should contain the biom file as exported by NG-Tax 2.0 and the metadata file. The metadata file should include descriptive information about your experiment, such as collection date, associated environment or location, used to contextualize and analyze data.
Below an example is given. This code will set the working directory to the path where the project is located.
here::i_am("basic_script.Rmd")
script_path <- here::here()
cat("The project root is located at:", script_path, "\n")
## The project root is located at: C:/Users/malin009/OneDrive - Wageningen University & Research/Dry_lab_intro/tutorials/16S/wetlands
setwd(script_path)
Phyloseq objects are commonly used for analyzing microbiome data in R. The phyloseq object can contain the following objects:
otu_table(): a table that contains in the columns the sample_ID and the rows the OTUs/ASVs, the OTU/ASV abundance (count) table per sample
tax_table():A table that presents the taxonomic assignment of each OTU
sample_data(): A table that contains the metadata of the samples
phy_tree(): The phylogenetic tree representing the evolutionary relationships between the microbial taxa in the dataset
refseq(): DNA string set - OTUs/ASVs sequences
The biom file created by NG-Tax 2.0 is composed only of otu_table and tax_table. Metadata and taxonomy tree need to be attached separately. In the metadata file, sample identifiers need to be the same as in the mapping file for NG-Tax 2.0. Additionally, if the composition of reference mocks 3 and 4 needs to be added manually, add 2 samples in the metadata file, namely refmock3 and refmock4. Only sample identifiers need to be given. All other columns can be blank.
The taxonomy tree is stored in the NG-Tax output under the name “all_otus.tree” and all sequences used to generate the tree are stored in the file “all_otus.fasta”.
ps <- read_phyloseq(otu.file = "wetlands_filling_tmock.biom",
taxonomy.file = NULL,
metadata.file = "metadata_UNLOCK_wetlands.csv",
type = "biom")
treefile <- read.tree("all_otus_mock.tree")
ps <- merge_phyloseq(ps, treefile)
print(ps)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 4074 taxa and 69 samples ]
## sample_data() Sample Data: [ 69 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 4074 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 4074 tips and 4073 internal nodes ]
ps_sum<- summarize_phyloseq(ps)
data.frame(do.call(rbind, ps_sum[c(1:10)]))
## do.call.rbind..ps_sum.c.1.10...
## 1 1] Min. number of reads = 14
## 2 2] Max. number of reads = 917500
## 3 3] Total number of reads = 3018962
## 4 4] Average number of reads = 43753.0724637681
## 5 5] Median number of reads = 17925
## 6 7] Sparsity = 0.929852084267145
## 7 6] Any OTU sum to 1 or less? NO
## 8 8] Number of singletons = 0
## 9 9] Percent of OTUs that are singletons \n (i.e. exactly one read detected across all samples)0
## 10 10] Number of sample variables are: 7
Here, you can check if your phyloseq object includes all the necessary data. Plus, the summary prints basic information of the data (reads, sparsity, singletons).
This code will return first 10 rows of otu table.
head(ps@otu_table, n=10) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
1_CW1_A_D30_1 | 1_CW1_A_D50_1 | 1_CW1_A_D50_2 | 1_CW1_A_D70_1 | 1_CW1_A_D70_2 | 1_CW1_A_D90_1 | 1_CW1_D_D30_1 | 1_CW1_D_D50_1 | 1_CW2_A_D50_2 | 1_CW2_A_D70_1 | 1_CW2_A_D93_1 | 1_CW2_A_D93_2 | 1_CW2_D_D50_1 | 1_CW2_D_D70_1 | 1_CW2_D_D93_1 | 1_CW3_A_D50_1 | 1_CW3_A_D50_2 | 1_CW3_A_D70_1 | 1_CW3_A_D70_2 | 1_CW3_A_D93_1 | 1_CW3_A_D93_2 | 1_CW3_D_D50_1 | 1_CW3_D_D70_1 | 1_CW3_D_D93_1 | 1_CW3_D_D93_2 | 1_CW3_D_D93_3 | 1_CW3_D50_1 | 1_CW3_D70_1 | 1_CWI_D_D70_1 | 1_CWI_D_D70_2 | 1_CWI_D_D90_1 | 1_CWI_D_D90_2 | 2_CW1_A_50_1 | 2_CW1_A_50_2 | 2_CW1_A_90_1 | 2_CW1_A_90_2 | 2_CW1_A70_1 | 2_CW1_D_50_1 | 2_CW1_D_50_2 | 2_CW1_D_70_1 | 2_CW1_D90_1 | 2_CW2_A_50_1 | 2_CW2_A_50_2 | 2_CW2_A_70_1 | 2_CW2_A_90_1 | 2_CW2_D_50_1 | 2_CW2_D_70_1 | 2_CW2_D_70_2 | 2_CW2_D50_1 | 2_CW3_A_50_1 | 2_CW3_A_50_2 | 2_CW3_A_70_1 | 2_CW3_A_70_2 | 2_CW3_A_90_1 | 2_CW3_A_90R_1 | 2_CW3_D_50_1 | 2_CW3_D_50_2 | 2_CW3_D_70_1 | 2_CW3_D_90_1 | 2_CW3_D_90_2 | Mock3 | Mock4 | Water_1 | Water_2 | Water_3 | Water_4 | Water_5 | refmock3 | refmock4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1632941857 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 70 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632942152 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632942190 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 83 | 0 | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632941008 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632942270 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 35 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632941052 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
163294987 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
163294896 | 0 | 0 | 0 | 0 | 0 | 133 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39 | 0 | 0 | 60 | 0 | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632942641 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1632942416 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 52 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The samples are in columns and ASVs are in rows. The values in the table indicate the absolute abundance (counts) of reads per ASV per sample.
This code will return first 10 rows of taxonomy table.
head(ps@tax_table, n=10) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
Domain | Phylum | Class | Order | Family | Genus | Species | |
---|---|---|---|---|---|---|---|
1632941857 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632942152 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632942190 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632941008 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632942270 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632941052 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
163294987 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
163294896 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632942641 | NA | p__ | c__ | o__ | f__ | g__ | s__ |
1632942416 | k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
We can see that many ASVs are not assigned to any taxonomy (“NA”). What is more, in some cases there is missing genus (“g__”) or family (“f__”) level taxonomy. This can happen for several reasons, such as lack of reference sequences in the database or low sequence similarity. When there are many mismatches with a sequence in the database or the sequences cannot be distinguished with high confidence, NG-Tax does not assign taxonomy or classifies one rank lower to ensure correct taxonomic assignment.
Domain | Phylum | Class | Order | Family | Genus | Species |
---|---|---|---|---|---|---|
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__ | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__ | g__ | s__ |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | o__Rickettsiales | f__Mitochondria | g__ | s__ |
k__Bacteria | p__Cyanobacteria | c__Cyanobacteriia | o__Chloroplast | f__ | g__ | s__ |
k__Bacteria | p__Bacteroidota | c__Bacteroidia | o__Bacteroidales | f__ | g__ | s__ |
It is good to clean a bit the taxonomy table before you proceed with the analysis and for example change the names of ASVs in such a way that the highest known taxonomy name is always shown (for example when Family is known, but for Genus no annotation was found). This way each ASV will keep as much of individuality as possible. It can be done manually or using the tax_fix function from microViz package (see Preprocessing Steps below).
To inspect the phylogenetic tree attached to the phyloseq object and generated from the ASV sequences, the following code can be used:
ggtree(ps@phy_tree, options(ignore.negative.edge=TRUE))
Metadata information can also be inspected as shown below. The code will print the first 10 rows of the sample data attached to the phyloseq object.
metadata<- ps %>% meta()
head(metadata, n=10) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
sample.description | sample.name | observation.unit.identifier | collection.date | season | location | depth | |
---|---|---|---|---|---|---|---|
1_CW1_A_D30_1 | Lib1_CW1-sand_A_D30_30_Jan_23_Sandfilled | 1_Lib1_CW1-sand_A_D30 | CW1-sand | 1/30/2023 | winter | A | 30 |
1_CW1_A_D50_1 | Lib1_CW1-sand_A_D50_30_Jan_23_Sandfilled | 2_Lib1_CW1-sand_A_D50 | CW1-sand | 1/30/2023 | winter | A | 50 |
1_CW1_A_D50_2 | Lib1_CW1-sand_A_D50_30_Jan_23_Sandfilled | 3_Lib1_CW1-sand_A_D50 | CW1-sand | 1/30/2023 | winter | A | 50 |
1_CW1_A_D70_1 | Lib1_CW1-sand_A_D70_30_Jan_23_Sandfilled | 4_Lib1_CW1-sand_A_D70 | CW1-sand | 1/30/2023 | winter | A | 70 |
1_CW1_A_D70_2 | Lib1_CW1-sand_A_D70_30_Jan_23_Sandfilled | 5_Lib1_CW1-sand_A_D70 | CW1-sand | 1/30/2023 | winter | A | 70 |
1_CW1_A_D90_1 | Lib1_CW1-sand_A_D90_30_Jan_23_Sandfilled | 6_Lib1_CW1-sand_A_D90 | CW1-sand | 1/30/2023 | winter | A | 90 |
1_CW1_D_D30_1 | Lib1_CW1-sand_D_D30_30_Jan_23_B/B1filled | 7_Lib1_CW1-sand_D_D30 | CW1-sand | 1/30/2023 | winter | D | 30 |
1_CW1_D_D50_1 | Lib1_CW1-sand_D_D50_30_Jan_23_B/B1filled | 8_Lib1_CW1-sand_D_D50 | CW1-sand | 1/30/2023 | winter | D | 50 |
1_CW2_A_D50_2 | Lib1_CW2-biochar_A_D50_30_Jan_23_B/B1filled | 9_Lib1_CW2-biochar_A_D50 | CW2-biochar | 1/30/2023 | winter | A | 50 |
1_CW2_A_D70_1 | Lib1_CW2-biochar_A_D70_30_Jan_23_B/B1filled | 10_Lib1_CW2-biochar_A_D70 | CW2-biochar | 1/30/2023 | winter | A | 70 |
In the process of 16S gene amplification some eukaryotic sequences, like chloroplasts or mitochondria, are likely to be amplified and should be identified and discarded (Hanshew et al., 2013). The code below shows the number of ASVs annotated as “o__Chloroplast” and “f__Mitochondria” and calculates the fraction of these ASVs. Before filtering out contaminant sequences, it is important to know their proportion to have a better overview of the data.
Chloroplast <- ntaxa(subset_taxa(ps, Order == "o__Chloroplast"))
Mitochondria <- ntaxa(subset_taxa(ps, Family == "f__Mitochondria"))
chl_mt_df<- data.frame(Taxa = c("Chloroplast", "Mitochondria"),
nASVs = c(Chloroplast, Mitochondria)) %>%
mutate(fraction = round(nASVs/ntaxa(ps), 3))
chl_mt_df
## Taxa nASVs fraction
## 1 Chloroplast 48 0.012
## 2 Mitochondria 8 0.002
The code below will discard ASVs classified as Mitochondria and Chloroplasts.
ps_filt<- subset_taxa(ps, Order != "o__Chloroplast")
ps_filt<- subset_taxa(ps_filt, Family != "f__Mitochondria")
The primers used for 16S rRNA amplification are designed to target both Bacteria and Archaea and therefore, it’s expected to find both in the dataset. Additionally, in the taxonomy table there might be other ASVs that are not annotated to any known Domain (“NA”). The table below shows the number of ASVs and their fraction classified at the Domain level, including ASVs of unknown Domain.
ps_filt@tax_table %>% data.frame() %>% group_by(Domain) %>% dplyr::count() %>%
mutate(fraction = round(n/ntaxa(ps_filt), 3))
## # A tibble: 3 × 3
## # Groups: Domain [3]
## Domain n fraction
## <chr> <int> <dbl>
## 1 NA 55 0.014
## 2 k__Archaea 47 0.012
## 3 k__Bacteria 3916 0.975
There are 55 ASVs (1.4%) not assigned to any Domain and 47 ASVs (1.2%) assigned to Archaea. You may check the ASVs sequences for unassigned taxa by searching in the “all_otus.fasta” and then Blast them to make sure these are not bacteria. The code below parses the sequences of unassigned ASVs into a separate file that can be uploaded into BLAST.
all_asvs <- readDNAStringSet("all_otus.fasta")
unassigned_asvs <- taxa_names(subset_taxa(ps, Domain == "NA") %>% tax_table())
all_asvs_df <- as.data.frame(all_asvs)
unassigned_asvs <- all_asvs_df %>% filter(row.names(all_asvs_df) %in% unassigned_asvs) %>%
rownames_to_column(var = "seq.name") %>% `colnames<-`(c("seq.name", "seq.text"))
dat2fasta(unassigned_asvs, outfile = "out.fasta")
## out.fasta has been saved to C:/Users/malin009/OneDrive - Wageningen University & Research/Dry_lab_intro/tutorials/16S/wetlands
The blast is done against “core nucleotide database” and we advise to
limit maximum target sequences to 10, this way it is easier to go
through all the data. To upload the result in R, a result of blast in
csv format should be downloaded (here: “out.csv”). It is not possible to
download result with taxonomy description, therefore packages
rentrez
and taxize
will be used to retrieve
this information. They use API to get the data from ncbi website.
blast_df <- read.csv("out.csv", header = FALSE)
colnames(blast_df) <- c("ASV", "sseqid", "per.ident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
sseqid<- unique(blast_df$sseqid)
taxa<- genbank2uid(sseqid)
taxa_df<- ldply(taxa) %>% dplyr::rename(id = V1) %>% mutate(GenBank = sseqid, name = sapply(taxa, function(x) attr(x, "name")))
tax_ids<-unique(taxa_df$id)
tax_summary<- entrez_summary(tax_ids, db = "taxonomy")
entrez_summary<-ldply(tax_summary, data.frame)
entrez_summary_sel<-entrez_summary %>%select(taxid, division, scientificname, genus, species, genbankdivision) %>% mutate(taxid = as.character(taxid))
blast_result<- blast_df %>% left_join(taxa_df, by=c("sseqid" = "GenBank")) %>% left_join(entrez_summary_sel, by=c("id" = "taxid"))
In order to decide whether each ASV comes from bacteria or not we will leave only the top result for each ASV (the one that has the lowest E value, which is the number of expected hits of similar quality (score) that could be found just by chance). Information about classification of taxa is stored as “division”. To simplify the output we will add a new column which will inform whether a sequence comes from bacteria, archae or eukaryota. All remaining ones are “unclassified sequences”. First we need to check what unique “divisions” are in our output and decide about new, simplified grouping (Be careful - this part of code is specific to the output).
unique(blast_result$division)
## [1] "bacteria" "d-proteobacteria"
## [3] "ascomycete fungi" "unclassified sequences"
## [5] "diatoms" "proteobacteria"
## [7] "green algae" "eukaryotes"
## [9] "brown algae" "archaea"
## [11] "euryarchaeotes" "firmicutes"
## [13] "flowering plants" "high G+C Gram-positive bacteria"
## [15] "CFB group bacteria"
Bacteria<- c("bacteria", "d-proteobacteria", "proteobacteria", "firmicutes", "high G+C Gram-positive bacteria", "CFB group bacteria")
Eukaryota<- c("ascomycete fungi", "diatoms", "green algae", "eukaryotes", "brown algae", "flowering plants")
Archaea<- c("archaea", "euryarchaeotes")
blast_result<- blast_result %>% mutate(Domain = case_when(division %in% Bacteria ~ "Bacteria",
division %in% Eukaryota ~ "Eukaryota",
division %in% Archaea ~ "Archaea",
TRUE ~ division))
blast_annot<- blast_result %>% group_by(ASV) %>% slice_head(n=1)%>% select(ASV, Domain)
Bacteria_ASVs<- blast_annot %>% filter(Domain == "Bacteria") %>% pull(ASV)
Archaea_ASVs<- blast_annot %>% filter(Domain == "Archaea") %>% pull(ASV)
Eukaryota_ASVs<- blast_annot %>% filter(Domain == "Eukaryota") %>% pull(ASV)
unclassified_prokaryote_ASVs<- blast_annot %>% filter(Domain == "unclassified sequences") %>% pull(ASV)
Only ASVs annotated in blast output as bacteria will be kept in the final phyloseq object. They account for 54.5454545 % of all blasted sequences.
The code below shows how to manually change annotations in the tax_table so all ASVs that are Bacteria but not annotated with SILVA will get the Bacteria domain name. For all the other taxonomy levels the names will remain to be NA. After performing this manual annotation, we will check the the number of ASVs in each Domain.
ps_filt<-ps_filt %>% tax_mutate(ASV = taxa_names(ps_filt)) %>%
tax_mutate(Domain = case_when(ASV %in% Bacteria_ASVs ~ "k__Bacteria",
ASV %in% Archaea_ASVs ~ "k__Archaea",
ASV %in% Eukaryota_ASVs ~ "k__Eukaryota",
ASV %in% unclassified_prokaryote_ASVs ~ "k__unclassified_prokaryota",
TRUE ~ Domain))
ps_filt@tax_table %>% data.frame() %>% group_by(Domain) %>% dplyr::count() %>%
mutate(fraction = round(n/ntaxa(ps_filt), 3))
## # A tibble: 4 × 3
## # Groups: Domain [4]
## Domain n fraction
## <chr> <int> <dbl>
## 1 k__Archaea 49 0.012
## 2 k__Bacteria 3946 0.982
## 3 k__Eukaryota 16 0.004
## 4 k__unclassified_prokaryota 7 0.002
Depending on the aim of the study, ASVs belonging to both Archaea and Bacteria might be of interest. If the project focus is on bacteria only, all ASVs that are not bacteria should be filtered out. It is important to keep all taxa that are bacteria, otherwise it can disturb the relative abundance of each taxon. For the purpose of this tutorial, we decided to leave only Bacteria in the dataset.
ps_filt<- subset_taxa(ps_filt, Domain == "k__Bacteria")
ps_filt@tax_table %>% data.frame() %>% group_by(Phylum) %>% dplyr::count() %>%
arrange(desc(n)) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
Phylum | n |
---|---|
p__Proteobacteria | 1117 |
p__Planctomycetota | 438 |
p__Bacteroidota | 390 |
p__Acidobacteriota | 320 |
p__Chloroflexi | 304 |
p__Verrucomicrobiota | 214 |
p__Actinobacteriota | 208 |
p__Firmicutes | 203 |
p__Myxococcota | 150 |
p__Desulfobacterota | 128 |
p__Gemmatimonadota | 77 |
p__Spirochaetota | 52 |
p__Nitrospirota | 41 |
p__Latescibacterota | 39 |
p__Bdellovibrionota | 33 |
p__Cyanobacteria | 31 |
p__ | 30 |
p__Patescibacteria | 24 |
p__Armatimonadota | 23 |
p__Fibrobacterota | 16 |
p__Zixibacteria | 15 |
p__SAR324_clade(Marine_group_B) | 11 |
p__Elusimicrobiota | 10 |
p__FCPU426 | 10 |
p__Dependentiae | 9 |
p__MBNT15 | 9 |
p__RCP2-54 | 9 |
p__Entotheonellaeota | 7 |
p__Methylomirabilota | 4 |
p__Sumerlaeota | 4 |
p__Campylobacterota | 2 |
p__Cloacimonadota | 2 |
p__PAUC34f | 2 |
p__WPS-2 | 2 |
p__AncK6 | 1 |
p__DTB120 | 1 |
p__Deferrisomatota | 1 |
p__Deinococcota | 1 |
p__Fermentibacterota | 1 |
p__Fusobacteriota | 1 |
p__Hydrogenedentes | 1 |
p__Margulisbacteria | 1 |
p__NB1-j | 1 |
p__Sva0485 | 1 |
p__Synergistota | 1 |
p__WS2 | 1 |
Now, the table includes only ASVs classified to a known domain. For other taxonomic levels, there are some unassigned ASVs. The following code will replace the uninformative or unknown entries (“unknowns”) with information from the highest known taxonomic rank (“classified”). It is important to understand the new annotation of taxa. Therefore, we recommend further reading of the tax_fix documentation. Important to understand is that when a taxon is annotated with the full name of another taxonomic rank (for example a genus is annotated with f__Enterobacteriaceae Family), it means that this taxon (genus in this example) is composed of all taxa annotated to the lower taxonomic rank (highest known - in this case Enterobacteriaceae family) but not annotated with the analyzed taxonomic rank (in this case genus).
ps_filt_fixed<- ps_filt %>%
tax_fix(
min_length = 4,
unknowns = c("g____", "f____", "o____", "g__uncultured_bacterium", "g__uncultured", "f__uncultured", "f__uncultured_bacterium", "o__uncultured", "o__uncultured_bacterium","o__uncultured_planctomycete", "o__uncultured_soil_bacterium", "c__uncultured", "c__uncultured_bacterium", "c__metagenome", "f__uncultured_archaeon", "f__uncultured_bacterium", "f__uncultured_euryarchaeote", "c__uncultured_bacterium Class", "f__metagenome Family", "o__metagenome", "o__uncultured_Chloroflexi_bacterium", "f__metagenome", "f__uncultured_Acidobacteria_bacterium", "f__uncultured_soil_bacterium", "f__uncultured_actinobacterium", "f__uncultured_cyanobacterium", "f__uncultured_delta_proteobacterium", "f__uncultured_gamma_proteobacterium", "f__uncultured_organism", "g__metagenome", "g__uncultured_Acidobacteria_bacterium", "g__uncultured_Cytophagales_bacterium", "g__uncultured_soil_bacterium", "g__uncultured_Bacteroidetes_bacterium", "g__uncultured_Chloroflexi_bacterium", "g__uncultured_organism", "g__uncultured_planctomycete"),
sep = " ", anon_unique = TRUE,
suffix_rank = "classified"
)
The genus names look like that right now:
ps_filt_fixed@tax_table %>%
data.frame() %>%
arrange(Genus)%>%
head(n=10) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
Domain | Phylum | Class | Order | Family | Genus | Species | ASV |
---|---|---|---|---|---|---|---|
k__Bacteria | p__Nitrospirota | c__4-29-1 | c__4-29-1 Class | c__4-29-1 Class | c__4-29-1 Class | c__4-29-1 Class | 1632943206 |
k__Bacteria | p__Actinobacteriota | c__Acidimicrobiia | c__Acidimicrobiia Class | c__Acidimicrobiia Class | c__Acidimicrobiia Class | c__Acidimicrobiia Class | 1632943971 |
k__Bacteria | p__Actinobacteriota | c__Acidimicrobiia | c__Acidimicrobiia Class | c__Acidimicrobiia Class | c__Acidimicrobiia Class | c__Acidimicrobiia Class | 1632943272 |
k__Bacteria | p__Actinobacteriota | c__Acidimicrobiia | c__Acidimicrobiia Class | c__Acidimicrobiia Class | c__Acidimicrobiia Class | c__Acidimicrobiia Class | 1632942591 |
k__Bacteria | p__Acidobacteriota | c__Acidobacteriae | c__Acidobacteriae Class | c__Acidobacteriae Class | c__Acidobacteriae Class | c__Acidobacteriae Class | 163294725 |
k__Bacteria | p__Acidobacteriota | c__Acidobacteriae | c__Acidobacteriae Class | c__Acidobacteriae Class | c__Acidobacteriae Class | c__Acidobacteriae Class | 1632943056 |
k__Bacteria | p__Actinobacteriota | c__Actinobacteria | c__Actinobacteria Class | c__Actinobacteria Class | c__Actinobacteria Class | c__Actinobacteria Class | 1632942355 |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | 1632941537 |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | 1632942154 |
k__Bacteria | p__Proteobacteria | c__Alphaproteobacteria | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | c__Alphaproteobacteria Class | s____ | 1632942601 |
It is highly recommended to have both negative and positive controls in the sequenced set of samples. Negative controls are crucial for identifying contaminants from reagents (PCR reagents, extraction kits), used equipment or introduced during sample preparation. Typically, they are no-template controls where water used during DNA extraction or PCR amplification is added instead. Negative controls are expected to have amount of reads close to zero. Positive controls are important for ensuring the accuracy and reliability of the sequencing process. In this dataset, mock communities are used as positive controls. Mock communities are synthetic communities of known composition, in this case typical for human gut, used to benchmark NG-Tax (Mock 3 and 4).
The table below shows the number of reads per sample. It is arranged in ascending order. Visualizing the sum of reads per sample will help in case you have high number of samples.
sum_of_reads<- as.data.frame(sample_sums(ps_filt_fixed@otu_table)) %>% rownames_to_column(var = "sampleID") %>%
dplyr::rename(sum_of_reads = `sample_sums(ps_filt_fixed@otu_table)`)%>%
left_join((metadata %>% rownames_to_column(var = "sampleID")), by = "sampleID")
head((sum_of_reads %>% arrange(sum_of_reads)),n=20)%>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
sampleID | sum_of_reads | sample.description | sample.name | observation.unit.identifier | collection.date | season | location | depth |
---|---|---|---|---|---|---|---|---|
Water_5 | 14 | Control_with_sterile_water_only_to_check | Water_control | Negative_Control | summer | NA | ||
Water_3 | 138 | Control_with_sterile_water_only_to_check | Water_control | Negative_Control | summer | NA | ||
Water_1 | 202 | Control_with_sterile_water_only_to_check | Water_control | Negative_Control | summer | NA | ||
Water_2 | 724 | Control_with_sterile_water_only_to_check | Water_control | Negative_Control | summer | NA | ||
Water_4 | 1259 | Control_with_sterile_water_only_to_check | Water_control | Negative_Control | summer | NA | ||
1_CW2_D_D93_1 | 4686 | Lib1_CW2-biochar_D_D93_30_Jan_23_B/B1filled | 15_Lib1_CW2-biochar_D_D93 | CW2-biochar | 1/30/2023 | winter | D | 93 |
2_CW3_D_50_2 | 5829 | Lib2_CW3-biochar_D_50_20_June_23_Sandfilled | 57_Lib2_CW3-biochar_D_50 | CW3-biochar | 6/20/2023 | summer | D | 50 |
1_CWI_D_D90_2 | 6018 | Lib1_CW1-sand_D_D90_30_Jan_23_B/B2filled | 32_Lib1_CW1-sand_D_D90 | CW1-sand | 1/30/2023 | winter | D | 90 |
1_CWI_D_D70_2 | 6614 | Lib1_CW1-sand_D_D70_30_Jan_23_B/B2filled | 30_Lib1_CW1-sand_D_D70 | CW1-sand | 1/30/2023 | winter | D | 70 |
2_CW1_A_50_2 | 8087 | Lib2_CW1-sand_A_50_20_June_23_B/B2filled | 34_Lib2_CW1-sand_A_50 | CW1-sand | 6/20/2023 | summer | A | 50 |
2_CW3_A_90_1 | 9219 | Lib2_CW3-biochar_A_90_20_June_23_Sandfilled | 54_Lib2_CW3-biochar_A_90 | CW3-biochar | 6/20/2023 | summer | A | 90 |
Mock3 | 9307 | Mock_3_Community_experimental | Mock_3_Community | Mock_Community | summer | NA | ||
2_CW1_D_70_1 | 9434 | Lib2_CW1-sand_D_70_20_June_23_B/B2filled | 40_Lib2_CW1-sand_D_70 | CW1-sand | 6/20/2023 | summer | D | 70 |
1_CW1_A_D90_1 | 9448 | Lib1_CW1-sand_A_D90_30_Jan_23_Sandfilled | 6_Lib1_CW1-sand_A_D90 | CW1-sand | 1/30/2023 | winter | A | 90 |
2_CW1_D_50_2 | 9919 | Lib2_CW1-sand_D_50_20_June_23_B/B2filled | 39_Lib2_CW1-sand_D_50 | CW1-sand | 6/20/2023 | summer | D | 50 |
2_CW2_D50_1 | 10470 | Lib2_CW2-biochar_D50__20_June_23_Sandfilled | 49_Lib2_CW2-biochar_D50_ | CW2-biochar | 6/20/2023 | summer | D | 50 |
1_CW3_A_D70_2 | 11188 | Lib1_CW3-biochar_A_D70_30_Jan_23_B/B2filled | 19_Lib1_CW3-biochar_A_D70 | CW3-biochar | 1/30/2023 | winter | A | 70 |
2_CW1_A70_1 | 12823 | Lib2_CW1-sand_A70__20_June_23_B/B2filled | 37_Lib2_CW1-sand_A70_ | CW1-sand | 6/20/2023 | summer | A | 70 |
1_CW3_D_D50_1 | 12976 | Lib1_CW3-biochar_D_D50_30_Jan_23_B/B2filled | 22_Lib1_CW3-biochar_D_D50 | CW3-biochar | 1/30/2023 | winter | D | 50 |
2_CW3_A_70_2 | 13035 | Lib2_CW3-biochar_A_70_20_June_23_Sandfilled | 53_Lib2_CW3-biochar_A_70 | CW3-biochar | 6/20/2023 | summer | A | 70 |
sum_of_reads %>% arrange(sum_of_reads) %>% filter(!sampleID %in% c("refmock3", "refmock4"))%>%
ggplot(aes(x = reorder(sampleID, sum_of_reads), y = sum_of_reads, fill = observation.unit.identifier)) +
geom_point(shape = 21, size = 3, color = "black")+
scale_fill_nejm()+
labs(x = "sample ID")+
theme_bw()+
theme(axis.text.x=element_text(angle=90))
As shown on the plot and in the table -negative controls have the lowest read count read count, which is 14, 138, 208, 724 and 1259. Since all experimental samples have higher read count than controls - none of the samples is discarded.
First, a check will be done to see whether the mocks that were
processed and sequenced in the laboratory, resemble the expected
(theoretical) values and can reproduce the theoretical composition. If
NG-Tax pipeline was used - the composition of theoretical mocks is added
during the NG-Tax sample processing (provided this option was marked in
the NG-Tax options) and is present in the output biom file. If another
pipeline or another mocks are used, the expected abundances of taxa in
these mocks needs to be attached to the abundance table of the sequenced
mocks. We advice to work on genus level and attach it to the ‘melted’
phyloseq data frame (the object created with `ps_melt’ function -
mock3_melt_Genus_rel
or mock4_melt_rel
), which
is the so called long table. A function that might be useful is the
?bind_rows
from dplyr package. The names of the columns in
the table that will be attached needs to be the same as the names in the
melted phyloseq. Mock 3 and 4 are different in composition therefore
they need to be analyzed separately. To do it a separate phyloseq with
mock3 samples will be created. Then a barplot will be printed to
visually assess similarity of the composition of experimental and
theoretical mocks. In the code we select the top taxa top plot (n=25),
so that the plot is not too crowded.
ps_mock3<- ps_filt_fixed %>% ps_filter(sample.name == "Mock_3_Community")
ps_mock3_Genus_rel<- ps_mock3 %>% tax_transform("compositional", rank = "Genus")
top_taxa_mock3<- tax_top(ps_mock3_Genus_rel, n=25)
mock3_melt_Genus_rel<- ps_mock3_Genus_rel %>% psmelt() %>%
mutate(Genus_top = case_when(Genus %in% top_taxa_mock3 ~ Genus,
TRUE ~ "other")) %>%
mutate(Genus_top = as.factor(Genus_top)) %>%
mutate(Genus_top = fct_relevel(Genus_top, c(top_taxa_mock3, "other")))
mock3_melt_Genus_rel %>%
ggplot(aes(x = Sample, y = Abundance, fill = Genus_top)) +
geom_bar(stat = "identity", color = "black")+
scale_fill_manual(values = distinct_palette(25, pal = "brewerPlus", add = "grey70"))+
theme_bw()+
theme(axis.title.x = element_blank(), legend.position = "right")
Apart from barplots it is also good to see how the abundance of genera in experimental mocks correlate with the abundance in theoretical mocks. To do so, You can plot scatter plots and inspect the Spearman correlation coefficient (they are shown at the intersection of the theoretical and experimental mock abundances). Additionally density plots are shown for each variable.
mock3_melt_Genus_rel %>% select(OTU, Sample, Abundance) %>%
pivot_wider(names_from = Sample, names_prefix = "Abundance_", values_from = "Abundance") %>%
column_to_rownames(var = "OTU")%>% ggpairs(upper = list(continuous = wrap("cor", method = "spearman")))
Now the same will be done with mock 4
ps_mock4<- ps_filt_fixed %>% ps_filter(sample.name == "Mock_4_Community")
ps_mock4_Genus_rel<- ps_mock4 %>% tax_transform("compositional", rank = "Genus")
top_taxa_mock4<- tax_top(ps_mock4_Genus_rel, n=25)
mock4_melt_Genus_rel<- ps_mock4_Genus_rel %>% psmelt() %>%
mutate(Genus_top = case_when(Genus %in% top_taxa_mock4 ~ Genus,
TRUE ~ "other")) %>%
mutate(Genus_top = as.factor(Genus_top)) %>%
mutate(Genus_top = fct_relevel(Genus_top, c(top_taxa_mock4, "other")))
mock4_melt_Genus_rel %>%
ggplot(aes(x = Sample, y = Abundance, fill = Genus_top)) +
geom_bar(stat = "identity", color = "black")+
scale_fill_manual(values = distinct_palette(25, pal = "brewerPlus", add = "grey70"))+
theme_bw()+
theme(axis.title.x = element_blank(), legend.position = "right")
and correlation
mock4_melt_Genus_rel %>% select(OTU, Sample, Abundance) %>%
pivot_wider(names_from = Sample, names_prefix = "Abundance_", values_from = "Abundance") %>%
column_to_rownames(var = "OTU")%>% ggpairs(upper = list(continuous = wrap("cor", method = "spearman")))
Rare taxa refer to taxa present at low relative abundance or in a small number of samples across the data set. Rare biosphere implicates low-abundance taxa as important contributors to the assessment of a- and b-diversity (Lynch et al., 2015). However, recent microbiome quality control studies indicate that many rare taxa might be created by sequencing artifacts, errors and/or contamination in SSU rRNA gene surveys. To address the challenge of identifying the “true” origin of these taxa, various approaches including trimming, filtering and/or contaminant removal have been proposed in literature (Cao et al.,2021, Reitmeier et al., 2021, Nikodemova et al., 2023, Davis et al., 2018).
NG-Tax 2.0 follows an error-handling strategy which filters artifacts and corrects for the impact of error-reads on the relative abundance estimates. By default, a user defined relative abundance cut-off is set to 0.1% of the total number of read-pairs associated with ASVs (Poncheewin et al., 2019). Thus, the output of NG-Tax 2.0 does not need additional filtering.
Depending on the origin of the sample and/or the aim of the study, NG-Tax 2.0 parameters can be changed. In studies that aim to discover rare taxa or is expected that rare taxa contribute to a larger diversity, filtering would not be advisable.
Before making any filtering decisions, a nice way to check the taxa distribution in your data set is to visualize the prevalence-abundance distribution of ASVs. Keep in mind that the minimum detection threshold of NG-Tax is set at 0.1%.
ps_filt_fixed_samples<- ps_filt_fixed %>%
ps_filter(!`observation.unit.identifier` %in% c("Mock_Community", "Negative_Control"))
taxa_prev_plot <- plot_taxa_prevalence(ps_filt_fixed_samples, "Phylum",
detection = 0)
taxa_prev_plot
Another common practice when working with this type of data is to check the number of ASVs represented by only one (singletons) or two (doubletons) sequences respectively, across all samples analyzed.
NG-Tax 2.0 error-handling assumes that singleton read-pairs that did not match existing ASV object due to a single mismatch (mutation or indel) is probably due to a random sequence error and is joined with the particular ASV. Singletons showing more than one mismatch are considered as sample specific noise and discarded.
length(which(taxa_sums(ps_filt_fixed_samples) <= 1))
## [1] 0
length(which(taxa_sums(ps_filt_fixed_samples) <= 2))
## [1] 0
In this dataset, there are no singletons (because NG-Tax filters them out) or doubletons so there is no need for filtering.
In case other bioinformatic pipelines were used for ASV generation, different approaches on how to deal with singletons and/or doubletons might be necessary before downstream analysis.
Let’s inspect the total abundance of each taxon (ASV) across all samples in the dataset. This is achieved with the taxa_sums function. To find out which taxa are the least abundant across the dataset, use the following code.
taxa_sums(ps_filt_fixed_samples) %>%
data.frame() %>%
dplyr::rename(sum = ".") %>%
arrange(sum)%>%
head(n=20) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "200px", height = "300px")
sum |
---|
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
5 |
Microbiome data is inherently compositional- meaning that sequencing only provides information on the relative abundance of taxa and that the observed abundance of each taxon is dependent on the observed abundances of all other taxa- and this poses challenges to downstream statistical analysis. To address this, normalization is employed to scale the data to a common and comparable level. Thanks to data normalization, the differential abundance analysis is not confounded by the differences in the sampling fractions (which is unknown and differs per sample). Failure to normalize the data will result in a systematic bias that increases the false discovery rate (FDR) and also possible loss of power in some cases.
Below we show two methods of normalization, however other common approaches for normalization (such as scaling of counts, parameter estimation, etc.) may be applied. as exists. To know more, here is a nice overview of the available normalization methods and how they perform on comparing ecological communities by McKnight et al (2019) and by Lin & Peddada (2020).
A common method is Total-Sum Scaling (TSS) (or compositional) which transforms the absolute ASV counts into relative abundance values. For more information, check out the articles: Zhou et al. (2023) and Weiss et al. (2017).
ps_filt_fixed_samples_rel<- tax_transform(ps_filt_fixed_samples, trans = "compositional")
head(ps_filt_fixed_samples_rel@otu_table, n=10) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
1_CW1_A_D30_1 | 1_CW1_A_D50_1 | 1_CW1_A_D50_2 | 1_CW1_A_D70_1 | 1_CW1_A_D70_2 | 1_CW1_A_D90_1 | 1_CW1_D_D30_1 | 1_CW1_D_D50_1 | 1_CW2_A_D50_2 | 1_CW2_A_D70_1 | 1_CW2_A_D93_1 | 1_CW2_A_D93_2 | 1_CW2_D_D50_1 | 1_CW2_D_D70_1 | 1_CW2_D_D93_1 | 1_CW3_A_D50_1 | 1_CW3_A_D50_2 | 1_CW3_A_D70_1 | 1_CW3_A_D70_2 | 1_CW3_A_D93_1 | 1_CW3_A_D93_2 | 1_CW3_D_D50_1 | 1_CW3_D_D70_1 | 1_CW3_D_D93_1 | 1_CW3_D_D93_2 | 1_CW3_D_D93_3 | 1_CW3_D50_1 | 1_CW3_D70_1 | 1_CWI_D_D70_1 | 1_CWI_D_D70_2 | 1_CWI_D_D90_1 | 1_CWI_D_D90_2 | 2_CW1_A_50_1 | 2_CW1_A_50_2 | 2_CW1_A_90_1 | 2_CW1_A_90_2 | 2_CW1_A70_1 | 2_CW1_D_50_1 | 2_CW1_D_50_2 | 2_CW1_D_70_1 | 2_CW1_D90_1 | 2_CW2_A_50_1 | 2_CW2_A_50_2 | 2_CW2_A_70_1 | 2_CW2_A_90_1 | 2_CW2_D_50_1 | 2_CW2_D_70_1 | 2_CW2_D_70_2 | 2_CW2_D50_1 | 2_CW3_A_50_1 | 2_CW3_A_50_2 | 2_CW3_A_70_1 | 2_CW3_A_70_2 | 2_CW3_A_90_1 | 2_CW3_A_90R_1 | 2_CW3_D_50_1 | 2_CW3_D_50_2 | 2_CW3_D_70_1 | 2_CW3_D_90_1 | 2_CW3_D_90_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1632942152 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00116556 | 0.001072578 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
1632941008 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.001587638 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
1632942677 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.007094396 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
1632941543 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.001680672 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
1632942189 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.002681444 | 0.00349239 | 0.00150585 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
1632943887 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.001646132 |
1632942431 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.001100968 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
163294897 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.011113463 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.001607518 | 0.01642879 | 0.01381361 | 0 | 0 | 0.001411433 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
163294951 | 0 | 0.000000000 | 0.000000000 | 0 | 0 | 0.002434378 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
163294410 | 0 | 0.002788586 | 0.001594434 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 | 0.000000000 | 0.00000000 | 0.00000000 | 0 | 0 | 0.000000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000000 |
Another method of data normalization is rarefaction. It addresses another challenge with 16S sequenced microbial datasets, which is the variation between the total number of sequencing reads generated for each sample. Rarefaction is random subsampling to an even sampling depth. Sampling is usually done without replacement. Rare sequences are excluded more often when sampling with replacement (Cameron et al., 2021). Rarefaction controls the effect of uneven sequencing effort (library size) across samples in a study. This normalization method is widely used, however not always advised and for example authors of phyloseq R package do not advocate using this as a normalization procedure (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531). There are also opposite views, as described in this article. This kind of normalization might be dataset specific, as described here, therefore it should be decided cautiously whether to rarefy or not. Given the fact that this method is highly debatable, consider applying it only in cases when there is highly uneven sequencing depth across libraries. The following part indicates how to apply rarefaction and guide the user through decision making steps. However, for the purpose of this tutorial it was decided to proceed with non-rarefied data for downstream analysis.
First, the distribution of total sequencing reads in the different samples has to be assessed. The scatter plot in the Quality Control section shows the total number of reads per sample. The variation in total reads sequenced between samples may introduce a bias when studying microbial community composition/diversity. Rarefying samples with high read counts to a certain read depth may alleviate the limitation imposed by total read count variation between samples.To make an educated decision on the rarefaction threshold, let’s explore the data through plotting rarefaction curves and histograms of read sums.
ps_filt_fixed_samples %>%
ggrare(step = 50, color = "observation.unit.identifier", se = FALSE)
sum_of_reads %>% filter(!sampleID %in% c("refmock3", "refmock4")) %>%
ggplot(aes(x=sum_of_reads)) +
geom_histogram(color = 'black', binwidth = 500, aes(fill = observation.unit.identifier))+
scale_fill_nejm()+
ggtitle("Histogram of Read Sums (binwidth = 500)")+
labs(x = "Read sums")+
theme_bw()+
theme(legend.position = "bottom", legend.title = element_blank())
sum_of_reads %>% filter(!sampleID %in% c("refmock3", "refmock4")) %>%
filter(sum_of_reads<20000)%>%
ggplot(aes(x=sum_of_reads)) +
geom_histogram(color = 'black', binwidth = 1000, aes(fill = observation.unit.identifier))+
scale_fill_nejm()+
ggtitle("Histogram of Read Sums < 20000 Reads (binwidth = 500)")+
labs(x = "Read sums")+
theme_bw()+
theme(legend.position = "bottom", legend.title = element_blank())
sum_of_reads %>% filter(!observation.unit.identifier %in% c("Mock_Community", "Negative_Control")) %>%
bind_cols(estimate_richness(ps_filt_fixed_samples, measures = "Observed")) %>%
remove_rownames() %>%
select(sampleID, sum_of_reads, Observed) %>% arrange(sum_of_reads, Observed) %>%
kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
sampleID | sum_of_reads | Observed |
---|---|---|
1_CW2_D_D93_1 | 4686 | 316 |
2_CW3_D_50_2 | 5829 | 395 |
1_CWI_D_D90_2 | 6018 | 459 |
1_CWI_D_D70_2 | 6614 | 385 |
2_CW1_A_50_2 | 8087 | 403 |
2_CW3_A_90_1 | 9219 | 272 |
2_CW1_D_70_1 | 9434 | 397 |
1_CW1_A_D90_1 | 9448 | 280 |
2_CW1_D_50_2 | 9919 | 330 |
2_CW2_D50_1 | 10470 | 392 |
1_CW3_A_D70_2 | 11188 | 339 |
2_CW1_A70_1 | 12823 | 432 |
1_CW3_D_D50_1 | 12976 | 243 |
2_CW3_A_70_2 | 13035 | 272 |
1_CW3_D_D93_2 | 13140 | 282 |
1_CW1_A_D50_2 | 13798 | 397 |
1_CW3_D50_1 | 14322 | 256 |
1_CW1_A_D70_2 | 14898 | 388 |
2_CW2_A_90_1 | 15009 | 196 |
1_CW1_A_D50_1 | 15420 | 389 |
2_CW3_A_50_2 | 15618 | 391 |
1_CW3_D_D70_1 | 15896 | 307 |
1_CW2_A_D50_2 | 15992 | 159 |
1_CW1_D_D50_1 | 16149 | 331 |
1_CW2_D_D70_1 | 16410 | 277 |
1_CW3_A_D50_1 | 16419 | 301 |
1_CW3_A_D93_2 | 17266 | 309 |
2_CW2_D_70_1 | 17323 | 384 |
1_CWI_D_D90_1 | 17329 | 416 |
2_CW2_D_50_1 | 17597 | 377 |
1_CWI_D_D70_1 | 18044 | 404 |
1_CW1_D_D30_1 | 18635 | 414 |
1_CW3_D_D93_3 | 18829 | 281 |
1_CW2_D_D50_1 | 18902 | 269 |
1_CW3_D70_1 | 19089 | 289 |
1_CW3_A_D70_1 | 19733 | 375 |
2_CW3_D_90_2 | 20047 | 174 |
1_CW2_A_D70_1 | 20230 | 272 |
2_CW1_A_90_2 | 20849 | 153 |
1_CW3_D_D93_1 | 21799 | 292 |
2_CW3_A_50_1 | 22419 | 402 |
2_CW3_D_70_1 | 22746 | 297 |
2_CW1_A_90_1 | 23191 | 152 |
1_CW2_A_D93_2 | 23318 | 324 |
2_CW2_A_50_2 | 23786 | 295 |
1_CW1_A_D30_1 | 24377 | 368 |
2_CW2_D_70_2 | 24403 | 282 |
2_CW1_D90_1 | 24538 | 193 |
1_CW3_A_D50_2 | 24689 | 295 |
2_CW3_A_90R_1 | 24836 | 279 |
1_CW1_A_D70_1 | 26491 | 342 |
1_CW3_A_D93_1 | 27202 | 293 |
1_CW2_A_D93_1 | 28109 | 307 |
2_CW1_D_50_1 | 28328 | 302 |
2_CW2_A_70_1 | 29918 | 253 |
2_CW1_A_50_1 | 32279 | 332 |
2_CW3_D_90_1 | 32684 | 186 |
2_CW3_A_70_1 | 33363 | 271 |
2_CW3_D_50_1 | 33804 | 383 |
2_CW2_A_50_1 | 48326 | 287 |
The table above shows a summary of taxa richness (number of Observed ASVs) and total reads per sample. To choose the rarefaction depth the safest approach it the smallest sequencing depth, in the case of this dataset is the read sum of sample 1_CW2_D_D93_1 (324 ASVs).
Plotting rarefaction curves illustrates that the sample 1_CW2_A_D50_2 is the least ASV-rich sample (161 ASVs) with 15,997 reads in total.
Plotting histograms, and inspecting read sum distribution illustrates that there is a gap in read sum distribution between 13,024 and 11,411. Therefore 13,024 emerges as a good cut-off for rarefaction.
On one hand, cutoffs 5,198 and 13,024 can attenuate the ASV richness variation between high read count samples and the least ASV rich sample “1_CW2_A_D50_2” due to library size variation. On the other hand, 5,198 could be a too conservative rarefaction cutoff leading to loss of species richness in samples with higher read counts, but with less ASV richness, such as the sample 1_CW2_A_D50_2 with 15,997 reads. As a result, we decide to use 13,024 reads as our rarefaction cut-off.
Rule of thumb: Rarefy to a depth that retains most of the data without sacrificing too much biological information (samples and/or ASVs).
The code below rarefies the dataset to a depth of 13,024 reads and plots rarefaction curves.
ps_filt_fixed_samples_raref<- rarefy_even_depth(ps_filt_fixed_samples, sample.size = 13024, replace = FALSE)
ps_filt_fixed_samples_raref %>% ggrare(step = 50, color = "observation.unit.identifier", se = FALSE)
## rarefying sample 1_CW1_A_D30_1
## rarefying sample 1_CW1_A_D50_1
## rarefying sample 1_CW1_A_D50_2
## rarefying sample 1_CW1_A_D70_1
## rarefying sample 1_CW1_A_D70_2
## rarefying sample 1_CW1_D_D30_1
## rarefying sample 1_CW1_D_D50_1
## rarefying sample 1_CW2_A_D50_2
## rarefying sample 1_CW2_A_D70_1
## rarefying sample 1_CW2_A_D93_1
## rarefying sample 1_CW2_A_D93_2
## rarefying sample 1_CW2_D_D50_1
## rarefying sample 1_CW2_D_D70_1
## rarefying sample 1_CW3_A_D50_1
## rarefying sample 1_CW3_A_D50_2
## rarefying sample 1_CW3_A_D70_1
## rarefying sample 1_CW3_A_D93_1
## rarefying sample 1_CW3_A_D93_2
## rarefying sample 1_CW3_D_D70_1
## rarefying sample 1_CW3_D_D93_1
## rarefying sample 1_CW3_D_D93_2
## rarefying sample 1_CW3_D_D93_3
## rarefying sample 1_CW3_D50_1
## rarefying sample 1_CW3_D70_1
## rarefying sample 1_CWI_D_D70_1
## rarefying sample 1_CWI_D_D90_1
## rarefying sample 2_CW1_A_50_1
## rarefying sample 2_CW1_A_90_1
## rarefying sample 2_CW1_A_90_2
## rarefying sample 2_CW1_D_50_1
## rarefying sample 2_CW1_D90_1
## rarefying sample 2_CW2_A_50_1
## rarefying sample 2_CW2_A_50_2
## rarefying sample 2_CW2_A_70_1
## rarefying sample 2_CW2_A_90_1
## rarefying sample 2_CW2_D_50_1
## rarefying sample 2_CW2_D_70_1
## rarefying sample 2_CW2_D_70_2
## rarefying sample 2_CW3_A_50_1
## rarefying sample 2_CW3_A_50_2
## rarefying sample 2_CW3_A_70_1
## rarefying sample 2_CW3_A_70_2
## rarefying sample 2_CW3_A_90R_1
## rarefying sample 2_CW3_D_50_1
## rarefying sample 2_CW3_D_70_1
## rarefying sample 2_CW3_D_90_1
## rarefying sample 2_CW3_D_90_2
Normalized data still exhibit skewed distributions, unequal variances, and extreme values, which may limit their effectiveness in situations with significant heterogeneity. Therefore, transforming data might be necessary before performing certain statistical tests (e.g. differential abundance analysis). The selection of the appropriate transformation method depends on the chosen statistical methods. It also helps when illustrating patterns involving both low and high abundance taxa. There are different types of data transformations, such as additive log-ratio (alr), centered log-ratio (clr) and robust centered log-ratio (rclr).
See also Nearing et al.(2022) and Wang et al. (2024).
ps_filt_fixed_samples_rel_rclr<- tax_transform(ps_filt_fixed_samples_rel, trans = "rclr", chain = TRUE)
head(ps_filt_fixed_samples_rel_rclr@otu_table, n=10) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "800px", height = "300px")
1_CW1_A_D30_1 | 1_CW1_A_D50_1 | 1_CW1_A_D50_2 | 1_CW1_A_D70_1 | 1_CW1_A_D70_2 | 1_CW1_A_D90_1 | 1_CW1_D_D30_1 | 1_CW1_D_D50_1 | 1_CW2_A_D50_2 | 1_CW2_A_D70_1 | 1_CW2_A_D93_1 | 1_CW2_A_D93_2 | 1_CW2_D_D50_1 | 1_CW2_D_D70_1 | 1_CW2_D_D93_1 | 1_CW3_A_D50_1 | 1_CW3_A_D50_2 | 1_CW3_A_D70_1 | 1_CW3_A_D70_2 | 1_CW3_A_D93_1 | 1_CW3_A_D93_2 | 1_CW3_D_D50_1 | 1_CW3_D_D70_1 | 1_CW3_D_D93_1 | 1_CW3_D_D93_2 | 1_CW3_D_D93_3 | 1_CW3_D50_1 | 1_CW3_D70_1 | 1_CWI_D_D70_1 | 1_CWI_D_D70_2 | 1_CWI_D_D90_1 | 1_CWI_D_D90_2 | 2_CW1_A_50_1 | 2_CW1_A_50_2 | 2_CW1_A_90_1 | 2_CW1_A_90_2 | 2_CW1_A70_1 | 2_CW1_D_50_1 | 2_CW1_D_50_2 | 2_CW1_D_70_1 | 2_CW1_D90_1 | 2_CW2_A_50_1 | 2_CW2_A_50_2 | 2_CW2_A_70_1 | 2_CW2_A_90_1 | 2_CW2_D_50_1 | 2_CW2_D_70_1 | 2_CW2_D_70_2 | 2_CW2_D50_1 | 2_CW3_A_50_1 | 2_CW3_A_50_2 | 2_CW3_A_70_1 | 2_CW3_A_70_2 | 2_CW3_A_90_1 | 2_CW3_A_90R_1 | 2_CW3_D_50_1 | 2_CW3_D_50_2 | 2_CW3_D_70_1 | 2_CW3_D_90_1 | 2_CW3_D_90_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1632942152 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.5943949 | -0.7687571 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
1632941008 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | -0.5779372 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
1632942677 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.051638 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
1632941543 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | -0.3715704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
1632942189 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.1475337 | 0.3636965 | -0.4370147 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
1632943887 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.5417619 |
1632942431 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | -0.7354975 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
163294897 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | 1.3679730 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | -0.2220174 | 1.301646 | 1.148943 | 0 | 0 | -0.54112 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
163294951 | 0 | 0.0000000 | 0.0000000 | 0 | 0 | -0.1504932 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
163294410 | 0 | 0.2871654 | -0.2571683 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 | 0 | 0.00000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0000000 |
It is possible to group the ASVs in the phyloseq object at different taxonomic ranks by aggregating the counts and inspect the data depending on your research question.
ps_filt_fixed_samples_phylum<- tax_agg(ps_filt_fixed_samples, rank = "Phylum")
print(ps_filt_fixed_samples_phylum)
## psExtra object - a phyloseq object with extra slots:
##
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 45 taxa and 60 samples ]
## sample_data() Sample Data: [ 60 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 45 taxa by 2 taxonomic ranks ]
##
## psExtra info:
## tax_agg = "Phylum"
To aggregate ASV counts at other taxonomic levels, set the desired rank and create a summary of the number of taxa at the different ranks.
ps_filt_fixed_samples_class<- tax_agg(ps_filt_fixed_samples, rank = "Class")
ps_filt_fixed_samples_order<- tax_agg(ps_filt_fixed_samples, rank = "Order")
ps_filt_fixed_samples_family<- tax_agg(ps_filt_fixed_samples, rank = "Family")
ps_filt_fixed_samples_genus<- tax_agg(ps_filt_fixed_samples, rank = "Genus")
summary_agg<- data.frame(Taxa = c(colnames(ps_filt_fixed_samples@tax_table)[c(2:6)], "ASVs"),
number_of_taxa=c(dim(ps_filt_fixed_samples_phylum@otu_table)[1], dim(ps_filt_fixed_samples_class@otu_table)[1], dim(ps_filt_fixed_samples_order@otu_table)[1], dim(ps_filt_fixed_samples_family@otu_table)[1], dim(ps_filt_fixed_samples_genus@otu_table)[1], dim(ps_filt_fixed_samples@otu_table)[1]))
print(summary_agg)
## Taxa number_of_taxa
## 1 Phylum 45
## 2 Class 129
## 3 Order 291
## 4 Family 466
## 5 Genus 741
## 6 ASVs 3772
It is always good idea to visualize all of the samples and quickly verify whether everything looks as expected (no strange results, samples with mainly one taxon present, if we do not expect it)
Here, we will explore microbial community composition (at order level) grouped by season and observation unit.
To vizualize data very often barplots and heatmaps showing individual samples are used.
In these barplots, the depth of sampling is shown on x axis.
barplots_season<- ps_filt_fixed_samples%>%
ps_mutate(season = as.factor(season)) %>%
ps_mutate(season = fct_relevel(season, c("winter", "summer"))) %>%
ps_arrange(season, depth)%>%
comp_barplot(group_by = "season", tax_level = "Order", tax_transform_for_plot = "compositional", n_taxa = 25, bar_width = 0.8, merge_other = FALSE, label = "depth", sample_order = "default")
barplots_season_wrapped<-wrap_plots(barplots_season) +
plot_layout(guides = 'collect', nrow = 2)
barplots_season_wrapped&
facet_wrap("observation.unit.identifier", scales = "free_x")&
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 10), axis.ticks.y = element_blank(), legend.title = element_blank(), legend.position = "bottom", strip.text = element_text(size=12),
legend.key.size = unit(0.4, 'cm'),
legend.text = element_text(size=10))
On the heatmap below all samples and all taxa are plotted. The color indicates the abundance of taxa. Similar samples based on their taxa composition are clustered together (based on hierarchical clustering). The size of the heatmap was too big because taxa were too many to visualize and for this we applied an additional step of filtering based on prevalence. Only taxa belonging to an order that have over 100 counts in at least 5% of samples are visualized. This heatmap gives also the possibility to annotate samples and taxa with additional information.
cols_oui<- pal_nejm("default")(3)
names(cols_oui) <- unique(samdat_tbl(ps_filt_fixed_samples)$`observation.unit.identifier`)
cols_season<- c("lightblue", "yellow")
names(cols_season) <- unique(samdat_tbl(ps_filt_fixed_samples)$season)
cols_location<- c("black", "grey")
names(cols_location) <- unique(samdat_tbl(ps_filt_fixed_samples)$location)
cols_depth<- brewer.pal(5, name = "YlOrRd")
names(cols_depth) <- unique(samdat_tbl(ps_filt_fixed_samples)$depth)
ps_filt_fixed_samples %>%
tax_transform("compositional", rank = "Order") %>%
tax_filter(min_prevalence = 0.1, undetected = 500, tax_level = "Order") %>%
comp_heatmap(colors = heat_palette(sym = FALSE, "Blue-Red 3"),
tax_anno = taxAnnotation(
Prev. = anno_tax_prev(bar_width = 0.3, size = grid::unit(1, "cm"))
),
sample_anno = sampleAnnotation(
Depth = anno_sample("depth"),
Season = anno_sample("season"),
Location = anno_sample("location"),
Group = anno_sample("observation.unit.identifier"),
col = list(Group = cols_oui, Season = cols_season, Location = cols_location, Depth = cols_depth), border = FALSE))
Another debatable topic is whether to calculate alpha-diversity metrics on rarefied or non-rarefied data. Richness-based metrics like Observed ASVs, phylogenetic diversity (PD) and Chao1, as well as Shannon (Schloss PD, 2024) are sensitive to sequencing depth variations, because one sample might show higher number of ASVs just because it was sequenced more which makes rarefaction necessary. Considering these limitations, we decided to perform calculations on non-rarefied data and focus on metrics less sensitive to sequencing dept variations. Here, we calculated 4 indices that depend on abundance/prevalence of taxa and 1 index which incorporates phylogenetic differences between species (Faith’s phylogenetic diversity).
alpha_div<- estimate_richness(ps_filt_fixed_samples, measures = c("Observed", "Shannon", "Simpson", "InvSimpson"))
PD<- calculatePD(ps_filt_fixed_samples, justDF = TRUE) %>% pull(PD)
alpha_div<- alpha_div %>% mutate(PD = PD)
rownames(alpha_div) <- gsub("^X", "", rownames(alpha_div))
head(alpha_div, n=20) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "700px", height = "300px")
Observed | Shannon | Simpson | InvSimpson | PD | |
---|---|---|---|---|---|
1_CW1_A_D30_1 | 368 | 5.612523 | 0.9944924 | 181.56645 | 29.64795 |
1_CW1_A_D50_1 | 389 | 5.715326 | 0.9953557 | 215.31554 | 34.68714 |
1_CW1_A_D50_2 | 397 | 5.740467 | 0.9955132 | 222.87770 | 34.88421 |
1_CW1_A_D70_1 | 342 | 5.366343 | 0.9900399 | 100.40064 | 30.88764 |
1_CW1_A_D70_2 | 388 | 5.588234 | 0.9934239 | 152.06686 | 35.20985 |
1_CW1_A_D90_1 | 280 | 5.340239 | 0.9926522 | 136.09582 | 28.01250 |
1_CW1_D_D30_1 | 414 | 5.789785 | 0.9957194 | 233.61273 | 30.77747 |
1_CW1_D_D50_1 | 331 | 4.931737 | 0.9684862 | 31.73210 | 29.92374 |
1_CW2_A_D50_2 | 159 | 4.173833 | 0.9688446 | 32.09715 | 15.26653 |
1_CW2_A_D70_1 | 272 | 5.100249 | 0.9894741 | 95.00413 | 24.87157 |
1_CW2_A_D93_1 | 307 | 4.829395 | 0.9666682 | 30.00135 | 30.57384 |
1_CW2_A_D93_2 | 324 | 4.857781 | 0.9655586 | 29.03483 | 31.31262 |
1_CW2_D_D50_1 | 269 | 5.040975 | 0.9855748 | 69.32308 | 24.73393 |
1_CW2_D_D70_1 | 277 | 5.054119 | 0.9866257 | 74.77035 | 24.42412 |
1_CW2_D_D93_1 | 316 | 5.514564 | 0.9947579 | 190.76185 | 28.42853 |
1_CW3_A_D50_1 | 301 | 5.254104 | 0.9905444 | 105.75699 | 27.85874 |
1_CW3_A_D50_2 | 295 | 5.216545 | 0.9901605 | 101.63087 | 26.21226 |
1_CW3_A_D70_1 | 375 | 5.648368 | 0.9951015 | 204.14372 | 29.86439 |
1_CW3_A_D70_2 | 339 | 5.530080 | 0.9942026 | 172.49120 | 29.44992 |
1_CW3_A_D93_1 | 293 | 5.244995 | 0.9910261 | 111.43475 | 28.09640 |
To compare the alpha-diversity between groups, we first need to attach metadata to the table and perform statistics using the code below.
metadata_samples<- metadata %>%filter(!`observation.unit.identifier` %in% c("Mock_Community", "Negative_Control"))
metadata_samples_alpha<- bind_cols(metadata_samples, alpha_div)
metadata_samples_long<- metadata_samples_alpha%>% pivot_longer(cols= Observed:PD, names_to = "alpha_div", values_to = "value")
kruskal_test_alpha<- metadata_samples_long %>% group_by(alpha_div) %>%
rstatix::kruskal_test(value ~ `observation.unit.identifier`) %>% mutate(p_adj_fdr = p.adjust(p, method = "fdr"))
kruskal_test_alpha %>% kbl() %>%
kableExtra::kable_paper(font_size = 12)
alpha_div | .y. | n | statistic | df | p | method | p_adj_fdr |
---|---|---|---|---|---|---|---|
InvSimpson | value | 60 | 10.095457 | 2 | 0.00642 | Kruskal-Wallis | 0.0101625 |
Observed | value | 60 | 9.624578 | 2 | 0.00813 | Kruskal-Wallis | 0.0101625 |
PD | value | 60 | 8.232205 | 2 | 0.01630 | Kruskal-Wallis | 0.0163000 |
Shannon | value | 60 | 10.783981 | 2 | 0.00455 | Kruskal-Wallis | 0.0101625 |
Simpson | value | 60 | 10.095457 | 2 | 0.00642 | Kruskal-Wallis | 0.0101625 |
Based on the Kruskal-Wallis test results, alpha-diversity differs significantly between groups for all of the indices. Using the following code, post-hoc test (Dunn’s test) can be performed to look for pairwise differences in alpha-diversity between groups.
dunn_test_alpha<- metadata_samples_long %>% group_by(alpha_div) %>%
rstatix::dunn_test(value ~ observation.unit.identifier, p.adjust.method = "fdr") %>%
add_xy_position(x = "observation.unit.identifier", fun = "mean_sd", step.increase = 0.05, scales = "free")
dunn_test_alpha %>% kbl() %>%
kableExtra::kable_paper(font_size = 12)
alpha_div | .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
InvSimpson | value | CW1-sand | CW2-biochar | 21 | 15 | -2.9390920 | 0.0032918 | 0.0098753 | ** | 250.590550 | CW1-sand…. | 1 | 2 |
InvSimpson | value | CW1-sand | CW3-biochar | 21 | 24 | -2.4501016 | 0.0142816 | 0.0214224 |
|
257.438875 | CW1-sand…. | 1 | 3 |
InvSimpson | value | CW2-biochar | CW3-biochar | 15 | 24 | 0.7944532 | 0.4269316 | 0.4269316 | ns | 264.287200 | CW2-bioc…. | 2 | 3 |
Observed | value | CW1-sand | CW2-biochar | 21 | 15 | -2.6256333 | 0.0086488 | 0.0129732 |
|
437.178100 | CW1-sand…. | 1 | 2 |
Observed | value | CW1-sand | CW3-biochar | 21 | 24 | -2.6950691 | 0.0070374 | 0.0129732 |
|
442.933750 | CW1-sand…. | 1 | 3 |
Observed | value | CW2-biochar | CW3-biochar | 15 | 24 | 0.2501066 | 0.8025049 | 0.8025049 | ns | 448.689400 | CW2-bioc…. | 2 | 3 |
PD | value | CW1-sand | CW2-biochar | 21 | 15 | -2.4309614 | 0.0150588 | 0.0225882 |
|
34.853150 | CW1-sand…. | 1 | 2 |
PD | value | CW1-sand | CW3-biochar | 21 | 24 | -2.4900241 | 0.0127734 | 0.0225882 |
|
35.186375 | CW1-sand…. | 1 | 3 |
PD | value | CW2-biochar | CW3-biochar | 15 | 24 | 0.2363063 | 0.8131950 | 0.8131950 | ns | 35.519600 | CW2-bioc…. | 2 | 3 |
Shannon | value | CW1-sand | CW2-biochar | 21 | 15 | -3.0294263 | 0.0024502 | 0.0073506 | ** | 5.918350 | CW1-sand…. | 1 | 2 |
Shannon | value | CW1-sand | CW3-biochar | 21 | 24 | -2.5459156 | 0.0108992 | 0.0163487 |
|
5.947375 | CW1-sand…. | 1 | 3 |
Shannon | value | CW2-biochar | CW3-biochar | 15 | 24 | 0.8002522 | 0.4235647 | 0.4235647 | ns | 5.976400 | CW2-bioc…. | 2 | 3 |
Simpson | value | CW1-sand | CW2-biochar | 21 | 15 | -2.9390920 | 0.0032918 | 0.0098753 | ** | 0.999150 | CW1-sand…. | 1 | 2 |
Simpson | value | CW1-sand | CW3-biochar | 21 | 24 | -2.4501016 | 0.0142816 | 0.0214224 |
|
0.999375 | CW1-sand…. | 1 | 3 |
Simpson | value | CW2-biochar | CW3-biochar | 15 | 24 | 0.7944532 | 0.4269316 | 0.4269316 | ns | 0.999600 | CW2-bioc…. | 2 | 3 |
The code below plots the alpha-diversity values and the post-hoc pairwise test results.
ggplot(metadata_samples_long, aes(x = observation.unit.identifier, y = value))+
geom_boxplot()+
geom_jitter(shape = 21, color = "black", size = 3, aes(fill = observation.unit.identifier))+
facet_wrap(~alpha_div, scales = "free", ncol = 3)+
scale_fill_nejm()+
stat_pvalue_manual(data = dunn_test_alpha,
label = "p.adj.signif", tip.length = 0.01, bracket.shorten = 0.05, hide.ns = TRUE, size = 4)+
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
theme_bw()+
theme(axis.text.x = element_text(size = 10), legend.position = "none")
Alpha-diversity is statistically significantly higher for CW1-samples compared to the other two groups. As mentioned above, these indices are sensitive to uneven sequencing depth. For this, interpretation should happen with great caution considering only diversity indices non-sensitive to variations in sequencing depth.
To measure how distinct or similar microbial communities are across different environments, conditions or treatments, beta-diversity analysis is performed where several distance or dissimilarity metrics tailored to capture specific data characteristics are used. Common beta-diversity metrics: Bray-Curtis dissimilarity (compositional), Jaccard index (compositional)and Unifrac (phylogenetic), each emphasizing different aspects of community composition (presence/absence, abundance, phylogenetic).
The code below calculates the distance matrix using Bray-Curtis and every value in this matrix represents the pairwise dissimilarity between two samples. To reduce the complexity of the distance matrix and represent sample relationships in lower-dimensional space, Principal Coordinate Analysis (PCoA) was performed. Statistical testing (PERMANOVA) followed to assess whether the observed patterns/differences are statistically significantly associated with analyzed groups from metadata. PERMANOVA tests for differences in the centroid (average position) of groups, indicating whether groups have different overall compositions.
PCoA_ASV_bray <- ordinate(ps_filt_fixed_samples_rel, distance = "bray", method = "PCoA")
beta_ASV_bray <- phyloseq::distance(ps_filt_fixed_samples_rel, "bray")
ado_bray_group<- adonis2(beta_ASV_bray ~ observation.unit.identifier, metadata_samples)
print(ado_bray_group)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = beta_ASV_bray ~ observation.unit.identifier, data = metadata_samples)
## Df SumOfSqs R2 F Pr(>F)
## Model 2 2.6198 0.13162 4.3199 0.001 ***
## Residual 57 17.2839 0.86838
## Total 59 19.9038 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
PERMANOVA assumes that the dispersion within groups is homogeneous, i.e. the spread of distances from the group centroid is similar across all groups. Unequal dispersions influence the interpretation of PERMANOVA results. To confirm that the observed differences in PERMANOVA are not simply due to one group having more variability in composition, it is a good practice to check for the homogenity of dispersions. For this, a beta-dispersion test will be performed. However, according to a paper by Anderson and Walsh (2013) it is most important if the sample sizes are not equal. If they are, the dispersion might be heterogenous and one could still interpret PERMANOVA results.
group.bd<- betadisper(d = beta_ASV_bray, group = metadata_samples$observation.unit.identifier, type="centroid")
boxplot(group.bd)
permutest(group.bd)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 2 0.01497 0.0074843 0.6005 999 0.573
## Residuals 57 0.71044 0.0124638
Dispersions do not differ between groups, therefore PERMANOVA results can be interpreted with confidence.
The code below shows how to plot results from PCoA analysis. Here a confidence ellipsis with mean values (plotted as diamonds) is added. This kind of ellipsis shows where are the mean results in each group and it also shows the dispersion of the samples around the mean.
plot_ordination(ps_filt_fixed_samples_rel, PCoA_ASV_bray, axes = c(1,2))+
stat_conf_ellipse(geom = "polygon", alpha = 0.1, aes(color = observation.unit.identifier, fill = observation.unit.identifier))+
geom_point(size = 3, shape = 21, aes(fill = observation.unit.identifier))+
stat_mean(aes(fill = observation.unit.identifier), shape = 23, size = 4) +
annotate("text", x =- 0.3, y = 0.5, color = "black", label = paste0("p=", round(ado_bray_group$`Pr(>F)`[1], digits = 3)))+
annotate("text", x = -0.3, y = 0.45, color = "black", label = paste0("R\u00b2=", round((ado_bray_group$R2[1])*100, digits = 2), "%"))+
scale_fill_nejm(name = "observation\nunit")+
scale_color_nejm(name = "observation\nunit")+
theme_bw()+
theme(legend.position = "right")
Following the above example, beta-diversity analysis is performed using Weighted Unifrac distance metric which accounts for both phylogenetic relationships between taxa and their relative abundance.
PCoA_ASV_wunifrac <- ordinate(ps_filt_fixed_samples_rel, distance = "wunifrac", method = "PCoA")
beta_ASV_wunifrac <- phyloseq::distance(ps_filt_fixed_samples_rel, "wunifrac")
ado_wunifrac_group<- adonis2(beta_ASV_wunifrac ~ observation.unit.identifier, metadata_samples)
print(ado_wunifrac_group)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = beta_ASV_wunifrac ~ observation.unit.identifier, data = metadata_samples)
## Df SumOfSqs R2 F Pr(>F)
## Model 2 0.10444 0.13845 4.5799 0.001 ***
## Residual 57 0.64992 0.86155
## Total 59 0.75436 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, the dispersion also needs to be checked
group_wunifrac.bd<- betadisper(d = beta_ASV_wunifrac, group = metadata_samples$observation.unit.identifier, type="centroid")
boxplot(group_wunifrac.bd)
permutest(group_wunifrac.bd)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 2 0.000482 0.00024084 0.1744 999 0.841
## Residuals 57 0.078723 0.00138110
The dispersion does not differ therefore conclusions can be drawn from PERMANOVA test
plot_ordination(ps_filt_fixed_samples_rel, PCoA_ASV_wunifrac, axes = c(1,2))+
stat_conf_ellipse(geom = "polygon", alpha = 0.1, aes(color = observation.unit.identifier, fill = observation.unit.identifier))+
geom_point(size = 3, shape = 21, aes(fill = observation.unit.identifier))+
stat_mean(aes(fill = observation.unit.identifier), shape = 23, size = 4) +
annotate("text", x = 0.1, y = -0.15, color = "black", label = paste0("p=", round(ado_wunifrac_group$`Pr(>F)`[1], digits = 3)))+
annotate("text", x = 0.1, y = -0.13, color = "black", label = paste0("R\u00b2=", round((ado_wunifrac_group$R2[1])*100, digits = 2), "%"))+
scale_fill_nejm(name = "observation\nunit")+
scale_color_nejm(name = "observation\nunit")+
theme_bw()+
theme(legend.position = "right")
A common practice in microbiome analysis is to identify which taxa significantly differ between sample groupings, conditions or treatments applied in the respective study. There are different ways to perform differential abundance testing with microbiome data. The researcher needs to decide which method suits the project aim the best. In the article by Nearing et al. (2022) the performance of different testing methods for DAA is compared. Among the tools showing the most consistent results was MaAsLin2 (Multivariate Association with Linear Models 2) and below we show how to use the R package. The output is generated into separate folders and files.
MaAsLin3 is also available for use and its tutorial can be found here.
There are several statistical challenges in analyzing microbiome data due to its unique properties that have led to the development of different approaches to ensure statistically reliable results (Yang and Chen, 2022).
Among others, one question raised when processing data before DAA is whether rare taxa should be filtered out. Filtering out potentially uninformative features before running statistical tests could help address the problem of statistical power reduction due to multiple test correction. Nearing et al.(2022). However, data filtering could have unexpected effects in some cases such as false positives increases (type I errors) Bourgon et al. (2010). This is another debatable topic when someone deals with high-dimensional data.
Given that sequencing data of microbial communities are compositional, filtering out features directly affects the relative abundance of each taxon. Nevertheless, it is reasonable to exclude less prevalent features prior to DA testing because their presence might prohibit identification of significant ASVs between conditions by the DA tools.
Choosing the optimal prevalence filtering cut-off involves certain exploratory steps. The following code will keep only taxa with a prevalence of 5% (taxa present in min. 5% of samples).
ps_filt_fixed_samples_prev<- tax_filter(ps_filt_fixed_samples,
min_prevalence = 0.05)
print(ps_filt_fixed_samples_prev)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1761 taxa and 60 samples ]
## sample_data() Sample Data: [ 60 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 1761 taxa by 8 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 1761 tips and 1760 internal nodes ]
In this function other parameters can be also set, such as prevalence
detection threshold (prev_detection_threshold). Therefore for example to
filter taxa with specific minimum abundance (eg. 0.25%) in a given
amount of samples (eg. two), the code would be as follows:
tax_filter(ps_filt_fixed_samples, min_prevalence = 2, prev_detection_threshold = 0.0025)
Let’s compare the number of taxa in filtered and unfiltered phyloseq object.
The number of taxa before filtering:
ntaxa(ps_filt_fixed_samples)
## [1] 3772
The number of taxa after filtering:
ntaxa(ps_filt_fixed_samples_prev)
## [1] 1761
The number of removed taxa:
ntaxa_removed <- ntaxa(ps_filt_fixed_samples)-ntaxa(ps_filt_fixed_samples_prev)
print(ntaxa_removed)
## [1] 2011
The number of removed taxa as proportion of all taxa:
ntaxa_removed*100/ntaxa(ps_filt_fixed_samples)
## [1] 53.31389
When applying a 5% prevalence threshold, more than 50% of the taxa are removed. This means that almost half of the taxa were present in at least 5% of the samples. Taxa appearing in less than 5% of the samples were filtered out.
To check what happens to the number of taxa when less strict prevalence thresholds are applied, use the following code:
prev_thresholds <- c(0.05, 0.02, 0.01)
prev_filt_results <- data.frame(min_prevalence = prev_thresholds, num_taxa_retained = NA, perc_taxa_removed = NA)
total_taxa <- nrow(tax_table(ps_filt_fixed_samples))
for (i in seq_along(prev_thresholds)) {
ps_filt_fixed_samples_prev_all <- ps_filt_fixed_samples %>%
tax_filter(min_prevalence = prev_thresholds[i])
num_taxa_retained <- nrow(tax_table(ps_filt_fixed_samples_prev_all))
perc_taxa_removed <- 100 * (total_taxa - num_taxa_retained) / total_taxa
prev_filt_results$num_taxa_retained[i] <- num_taxa_retained
prev_filt_results$perc_taxa_removed[i] <- perc_taxa_removed
}
print(prev_filt_results)
## min_prevalence num_taxa_retained perc_taxa_removed
## 1 0.05 1761 53.31389
## 2 0.02 2480 34.25239
## 3 0.01 3772 0.00000
As mentioned above, depending on the research question, the filtering threshold could be an important consideration. For the purpose of the present analysis, we will focus on more common taxa across the samples and thus, we will apply minimum prevalence of 5%.
Since 16S rRNA gene sequencing has limited resolution at species level, usually genus level is used. However in case of this dataset many genera were not annotated, therefore we will use the highest rank with the most annotated taxa, which was order level. MaAsLin2 requires input files (feature abundance and metadata) in specific format. For more information on how to run it, check out the link. The code below will prepare the input data which needs to be a data frame. Therefore, the ASV table needs to be extracted from the phyloseq object. Because a phyloseq object grouped at order level will be used - taxonomy is already added as row names to otu_table. Differential abundance analysis in this example is performed on compositionally transformed data.
ps_filt_fixed_samples_order_rel<- tax_transform(ps_filt_fixed_samples_order, trans = "compositional")
input_sample_metadata <- ps_filt_fixed_samples_order_rel%>% meta
input_data_order_Maaslin <- as.data.frame(otu_table(ps_filt_fixed_samples_order_rel))
MaAsLin2 requires a file with features as columns and samples as rows. Any other non-numeric columns (e.g., taxonomic labels) in the ASV table should be removed.
input_data_order_Maaslin <- input_data_order_Maaslin[sapply(input_data_order_Maaslin, is.numeric)]
It is important to make sure the input_metadata is a data frame.
For pairwise comparisons done in MaAsLin2 it is important to indicate which group of samples is the reference. There are 3 ways of doing that: - setting the column describing grouping variable as a factor with 1st level assigned to a control group, - changing names of the groups by prefixing the desired reference category with strings such as “a_” - MaAsLin2 by default treats the first category in alphabetical order as the reference. - by indicating which group (in which column of the metadata) is the reference.
In this tutorial we will use the last method.
Before running MaAsLin2 we need to define the output folder.
output_directory_order <- "Maaslin2_output_order"
Run the MaAsLin2 analysis. The code below uses a linear model to fit the data and identify associations between microbial features and metadata variables. Here we have set the parameters for abundance and prevalence filtering (ASVs that will be tested need to be present in at least 5% of samples). Therefore, we will use the phyloseq object before filtering by prevalence. Data is already normalized using total-sum scaling (TSS),therefore in the options of MaAsLin2 function we choose not to normalize the data for the second time. However, we choose to log transform it. P-values are adjusted using the default Benjamini-Hochberg (BH) method. For other options please check the documentation by executing C:/ProgramData/R/win-library/4.4/Maaslin2/help/Maaslin2 in the console. You can also set random effects to the model.
fit_data_order <- Maaslin2::Maaslin2(
input_data = input_data_order_Maaslin,
input_metadata = input_sample_metadata,
output = output_directory_order,
analysis_method = "LM",
fixed_effects = "observation.unit.identifier",
normalization = "NONE",
correction = "BH",
transform = "LOG",
min_abundance = 0,
min_prevalence = 0.05,
reference = "observation.unit.identifier,CW1-sand")
The result of the run is stored in the separate folder which we set above in the working directory. Visualization output files include separate boxplots for each significant association with categorical metadata and scatterplots with fitted regression line for continuous metadata. Additionally, a heatmap showing 50 top features (at order level) significantly associated with analyzed variables. All results are also saved in tables (incl. all associations and significant associations in separate files). There is also RDS file with all residuals of the model and a log file.
Full explanation can be found here: https://huttenhower.sph.harvard.edu/maaslin/
The first few lines of the significant results looks like that:
head(fit_data_order$results, n=20) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "1000px", height = "500px")
feature | metadata | value | coef | stderr | pval | name | qval | N | N.not.zero |
---|---|---|---|---|---|---|---|---|---|
o__Gemmatales | observation.unit.identifier | CW3-biochar | -3.698429 | 0.4957888 | 0.00e+00 | observation.unit.identifierCW3-biochar | 0.0000002 | 60 | 39 |
c__Gammaproteobacteria.Class | observation.unit.identifier | CW3-biochar | 2.704977 | 0.4215605 | 0.00e+00 | observation.unit.identifierCW3-biochar | 0.0000058 | 60 | 32 |
c__Subgroup_11.Class | observation.unit.identifier | CW3-biochar | -1.769331 | 0.3071663 | 4.00e-07 | observation.unit.identifierCW3-biochar | 0.0000362 | 60 | 25 |
o__Polyangiales | observation.unit.identifier | CW3-biochar | 3.046451 | 0.5297626 | 4.00e-07 | observation.unit.identifierCW3-biochar | 0.0000362 | 60 | 59 |
o__Gemmatales | observation.unit.identifier | CW2-biochar | -3.187981 | 0.5609210 | 5.00e-07 | observation.unit.identifierCW2-biochar | 0.0000372 | 60 | 39 |
o__Thermoanaerobaculales | observation.unit.identifier | CW3-biochar | 2.907719 | 0.5315996 | 1.00e-06 | observation.unit.identifierCW3-biochar | 0.0000685 | 60 | 34 |
o__Thermomicrobiales | observation.unit.identifier | CW3-biochar | -1.801108 | 0.3509496 | 3.60e-06 | observation.unit.identifierCW3-biochar | 0.0002024 | 60 | 28 |
o__PAUC26f | observation.unit.identifier | CW3-biochar | -1.757618 | 0.3570018 | 7.60e-06 | observation.unit.identifierCW3-biochar | 0.0003762 | 60 | 28 |
c__MB.A2.108.Class | observation.unit.identifier | CW3-biochar | -1.762682 | 0.3699913 | 1.35e-05 | observation.unit.identifierCW3-biochar | 0.0005896 | 60 | 16 |
o__Ectothiorhodospirales | observation.unit.identifier | CW2-biochar | 2.132338 | 0.4523215 | 1.61e-05 | observation.unit.identifierCW2-biochar | 0.0006331 | 60 | 30 |
o__Sphingomonadales | observation.unit.identifier | CW3-biochar | -1.857668 | 0.4003936 | 2.09e-05 | observation.unit.identifierCW3-biochar | 0.0007483 | 60 | 47 |
o__Reyranellales | observation.unit.identifier | CW3-biochar | -1.753530 | 0.3874612 | 3.11e-05 | observation.unit.identifierCW3-biochar | 0.0010209 | 60 | 31 |
o__Rhodobacterales | observation.unit.identifier | CW3-biochar | -2.405116 | 0.5548591 | 6.01e-05 | observation.unit.identifierCW3-biochar | 0.0017358 | 60 | 19 |
o__RBG.13.54.9 | observation.unit.identifier | CW3-biochar | 1.540412 | 0.3560167 | 6.17e-05 | observation.unit.identifierCW3-biochar | 0.0017358 | 60 | 45 |
c__Alphaproteobacteria.Class | observation.unit.identifier | CW3-biochar | -1.931087 | 0.4564534 | 8.55e-05 | observation.unit.identifierCW3-biochar | 0.0017857 | 60 | 41 |
o__Ectothiorhodospirales | observation.unit.identifier | CW3-biochar | 1.701981 | 0.3997995 | 7.82e-05 | observation.unit.identifierCW3-biochar | 0.0017857 | 60 | 30 |
p__FCPU426.Phylum | observation.unit.identifier | CW3-biochar | 1.297098 | 0.3035762 | 7.41e-05 | observation.unit.identifierCW3-biochar | 0.0017857 | 60 | 20 |
c__uncultured_Acidobacterium_sp..Class | observation.unit.identifier | CW3-biochar | -1.452569 | 0.3396592 | 7.32e-05 | observation.unit.identifierCW3-biochar | 0.0017857 | 60 | 17 |
c__BD2.11_terrestrial_group.Class | observation.unit.identifier | CW3-biochar | -1.608018 | 0.3802796 | 8.61e-05 | observation.unit.identifierCW3-biochar | 0.0017857 | 60 | 20 |
o__Sphingomonadales | observation.unit.identifier | CW2-biochar | -1.899004 | 0.4529937 | 9.74e-05 | observation.unit.identifierCW2-biochar | 0.0018759 | 60 | 47 |
This is the heatmap.
knitr::include_graphics("./Maaslin2_output_order/figures/heatmap.png")
heatmap, MaAslin2, order level
And one of the boxplots
knitr::include_graphics("./Maaslin2_output_order/figures/observation_unit_identifier_1.png")
boxplot, MaAslin2, order level
All MaAsLin2 analyses can be run also using other taxonomic ranks.
LinDa is another approach to perform DAA which only requires fitting linear regression models on CLR-transformed abundance data as response. This method addresses the compositional effects by correcting the bias using the mode of the regression coefficients and can be extended to mixed-effect models for correlated microbiome data (Zhou et al. (2022)).
LinDA function has been integrated into the CRAN package MicrobiomeStat.
To run LinDa, a matrix of counts/proportions (row-features, column-samples) and a data frame containing sample metadata (without NAs) are needed. In this case, we will use the same phyloseq object (relative abundance, order level) as in MaAsLin2 to create the input data for LinDA.
asv.mat <- otu_table(ps_filt_fixed_samples_order_rel) %>% as.matrix()
metadata.dat <- meta(ps_filt_fixed_samples_order_rel) %>% as.data.frame()
Before running the model, consider which are the fixed and random effects in your dataset. Here, several variables could be tested for their effect on the microbiota composition (type of CW, depth, location, season). In this example, to compare LinDa results to MaasLin2 the type of CW is used as fixed effect. Similar to MaasLin2, the reference group for comparisons is typically determined by alphabetical order of the group labels in the metadata unless specified otherwise.
Prevalence filter will be set at 0.05 as in MaasLin2 above. Other parameters that need consideration for optimizing the performance of LinDa are mentioned in the R documentation of the function and in this article by Li et al. (2023).
linda.obj.order <- linda(feature.dat = asv.mat, meta.dat = metadata.dat,
feature.dat.type = 'proportion',
formula = '~observation.unit.identifier',
prev.filter = 0.05)
## 70 features are filtered!
## The filtered data has 60 samples and 221 features will be tested!
## Fit linear models ...
## Completed.
The output message gives information about the results of the filtering (prevalence and/or abundance) and the method used for handling zeros in the dataset.
The output file is an object (list of elements) which includes the variable names of the fixed effects based on the formula used, the estimated bias of the regression coefficients due to the compositional effect and a list of data frames including the results of the DAA. All relevant information can be found here.
Let’s inspect the results of LinDA for each comparison.
linda.out <- as.data.frame(linda.obj.order$output)
head(linda.out, n=20) %>% kbl() %>%
kableExtra::kable_paper(font_size = 12) %>%
scroll_box(width = "1000px", height = "500px")
observation.unit.identifierCW2.biochar.baseMean | observation.unit.identifierCW2.biochar.log2FoldChange | observation.unit.identifierCW2.biochar.lfcSE | observation.unit.identifierCW2.biochar.stat | observation.unit.identifierCW2.biochar.pvalue | observation.unit.identifierCW2.biochar.padj | observation.unit.identifierCW2.biochar.reject | observation.unit.identifierCW2.biochar.df | observation.unit.identifierCW3.biochar.baseMean | observation.unit.identifierCW3.biochar.log2FoldChange | observation.unit.identifierCW3.biochar.lfcSE | observation.unit.identifierCW3.biochar.stat | observation.unit.identifierCW3.biochar.pvalue | observation.unit.identifierCW3.biochar.padj | observation.unit.identifierCW3.biochar.reject | observation.unit.identifierCW3.biochar.df | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
k__Bacteria Domain | 2615.1623 | -0.9464056 | 0.4701471 | -2.0129990 | 0.0488465 | 0.1587511 | FALSE | 57 | 2615.1623 | -0.9225019 | 0.4155552 | -2.2199261 | 0.0304168 | 0.0873001 | FALSE | 57 |
o__Rickettsiales | 1882.3681 | -0.8726679 | 0.3684927 | -2.3682092 | 0.0212886 | 0.0931557 | FALSE | 57 | 1882.3681 | -1.2036368 | 0.3257046 | -3.6954857 | 0.0004942 | 0.0034891 | TRUE | 57 |
o__Gemmatimonadales | 16401.9780 | -1.3938145 | 0.5580967 | -2.4974426 | 0.0154208 | 0.0774682 | FALSE | 57 | 16401.9780 | -1.8783182 | 0.4932924 | -3.8077174 | 0.0003453 | 0.0029486 | TRUE | 57 |
o__Candidatus_Woesebacteria | 870.2351 | 0.0857696 | 0.2033907 | 0.4216989 | 0.6748310 | 0.7920579 | FALSE | 57 | 870.2351 | 0.2128566 | 0.1797736 | 1.1840254 | 0.2413171 | 0.3555405 | FALSE | 57 |
o__Bacteroidales | 2727.4743 | 1.4723457 | 0.7833587 | 1.8795293 | 0.0652873 | 0.1873831 | FALSE | 57 | 2727.4743 | 1.6314697 | 0.6923979 | 2.3562605 | 0.0219211 | 0.0692081 | FALSE | 57 |
o__Candidatus_Yanofskybacteria | 837.3316 | 0.0025060 | 0.1057869 | 0.0236890 | 0.9811834 | 0.9904800 | FALSE | 57 | 837.3316 | 0.1017271 | 0.0935033 | 1.0879528 | 0.2811949 | 0.3933169 | FALSE | 57 |
o__Candidatus_Peribacteria | 809.2383 | -0.0509682 | 0.2828091 | -0.1802213 | 0.8576181 | 0.9156212 | FALSE | 57 | 809.2383 | -0.1244858 | 0.2499703 | -0.4980025 | 0.6203983 | 0.6968140 | FALSE | 57 |
o__Gemmatales | 18857.2374 | -3.1372107 | 0.5098128 | -6.1536524 | 0.0000001 | 0.0000178 | TRUE | 57 | 18857.2374 | -3.6320969 | 0.4506151 | -8.0603087 | 0.0000000 | 0.0000000 | TRUE | 57 |
o__Planctomycetales | 23581.0263 | -2.7635271 | 0.7660334 | -3.6075803 | 0.0006519 | 0.0146899 | TRUE | 57 | 23581.0263 | -1.8579543 | 0.6770843 | -2.7440517 | 0.0081000 | 0.0314055 | TRUE | 57 |
o__Pirellulales | 29513.5682 | -0.4906958 | 0.5191926 | -0.9451133 | 0.3485902 | 0.5135896 | FALSE | 57 | 29513.5682 | -0.6835982 | 0.4589057 | -1.4896266 | 0.1418353 | 0.2487745 | FALSE | 57 |
o__Opitutales | 3113.7280 | 1.5026224 | 0.6038263 | 2.4885012 | 0.0157741 | 0.0774682 | FALSE | 57 | 3113.7280 | 1.5131727 | 0.5337121 | 2.8351855 | 0.0063269 | 0.0263819 | TRUE | 57 |
o__S-BQ2-57_soil_group | 2211.2237 | -0.8820417 | 0.3743333 | -2.3563001 | 0.0219190 | 0.0931557 | FALSE | 57 | 2211.2237 | -1.2297018 | 0.3308671 | -3.7166041 | 0.0004621 | 0.0034044 | TRUE | 57 |
o__Methylacidiphilales | 700.1043 | 0.2995374 | 0.1921255 | 1.5590716 | 0.1245149 | 0.2942982 | FALSE | 57 | 700.1043 | 0.4777785 | 0.1698166 | 2.8134973 | 0.0067130 | 0.0274587 | TRUE | 57 |
o__Verrucomicrobiales | 6260.8613 | -1.8827093 | 0.4593685 | -4.0984728 | 0.0001332 | 0.0058880 | TRUE | 57 | 6260.8613 | -1.4976424 | 0.4060282 | -3.6885178 | 0.0005052 | 0.0034891 | TRUE | 57 |
o__Pedosphaerales | 13430.6233 | -0.2928035 | 0.4951305 | -0.5913663 | 0.5566131 | 0.7151831 | FALSE | 57 | 13430.6233 | -0.4257263 | 0.4376376 | -0.9727826 | 0.3347716 | 0.4377782 | FALSE | 57 |
o__Chthoniobacterales | 11366.3380 | -1.0685677 | 0.5999567 | -1.7810745 | 0.0802288 | 0.2136212 | FALSE | 57 | 11366.3380 | -1.6244625 | 0.5302918 | -3.0633367 | 0.0033392 | 0.0160426 | TRUE | 57 |
c__OM190 Class | 5451.4862 | -1.3953827 | 0.4325716 | -3.2257841 | 0.0020825 | 0.0230117 | TRUE | 57 | 5451.4862 | -1.1341031 | 0.3823429 | -2.9661934 | 0.0043989 | 0.0198056 | TRUE | 57 |
o__11-24 | 1514.9431 | -0.7997962 | 0.3304210 | -2.4205372 | 0.0187066 | 0.0879609 | FALSE | 57 | 1514.9431 | -0.6364988 | 0.2920536 | -2.1793902 | 0.0334496 | 0.0918395 | FALSE | 57 |
o__Rhizobiales | 115463.1362 | 0.1687520 | 0.4340918 | 0.3887474 | 0.6989114 | 0.7920996 | FALSE | 57 | 115463.1362 | 0.2083787 | 0.3836866 | 0.5430962 | 0.5891812 | 0.6746582 | FALSE | 57 |
o__Rhodobacterales | 3731.6519 | -1.2185165 | 0.5947999 | -2.0486158 | 0.0451144 | 0.1533890 | FALSE | 57 | 3731.6519 | -2.2240341 | 0.5257338 | -4.2303425 | 0.0000856 | 0.0011125 | TRUE | 57 |
LinDA results can be plotted using the ‘linda.plot’ function which produces the effect size plot of the differential features and volcano plots based on the output. For more details on how the function works, see here.
In this function, the object from function linda is used as input and other arguments such as the variable whose results are to be plotted, significance level (alpha), log2fc cut-off and directory to save the output plots can be set.
The output file is a list of ggplot2 objects and more specifically, a list of effect size plots and a list of volcano plots found in the output directory. In both cases, each plot corresponds to one variable. Plots are saved in pdf format in the output directory. To store the figures separately, see below.
lfc.plot1 <- linda.res.plot.order$plot.lfc[[1]]
volc.plot1 <- linda.res.plot.order$plot.volcano[[1]]
ggsave("./linda_plots/lfc.plot1.png", lfc.plot1, device = png, width = 10, height = 7)
ggsave("./linda_plots/volc.plot1.png", volc.plot1, device = png, width = 10, height = 7)
knitr::include_graphics(path = "./linda_plots/lfc.plot1.png")
This is one of the effect size plots of the differentially abundant features (taxa at order level) between the different types of wetlands in the dataset. Debiased effect size is the adjusted effect size that accounts for the compositional nature of microbiome data (corrected for systematic biases-more reliable). Biased effect size is also reported for comparative reasons and to understand the magnitude of correction applied. It allows to see how much the compositional nature of the data may have affected the raw estimates.
knitr::include_graphics(path = "./linda_plots/volc.plot1.png")
This is one of the volcano plots of the differentially abundant features (taxa at order level) between the different types of wetlands in the dataset. Taxa (blue data points) in the upper right corner are enriched in the test group (CW2 or CW3) while in the upper left corner are depleted in the test group or enriched in the reference group (CW1).
It is expected that different tools will give different output as they might test different hypotheses and use different statistical tests, transformations and they may assume different data distribution. Here, both MaAsLin2 and LinDA apply linear models, however in MaasLin2 - log transformation (default) was used, whereas in LinDA - CLR. Additionally, LinDA applied bias correction, that was initially introduced by ANCOM-BC workflow (Lin, Peddada, 2020).
To compare the results we will check how many features have statistically significantly different abundance when using both workflows. In order to do so, we will combine the results summary from MaAsLin2 and LinDA and focus on features with adjusted p value <0.05. Before merging results from both pipelines, we need to unify the names of the features, since MaAsLin2 replaced spaces, dashes and brackets with dots. Therefore, we will do the same with LinDA output.
linda.out_p.adj<- linda.out %>% rownames_to_column(var = "feature") %>%
select(feature, observation.unit.identifierCW2.biochar.padj, observation.unit.identifierCW3.biochar.padj) %>%
mutate(feature = str_replace_all(feature, " ", ".")) %>%
mutate(feature = str_replace_all(feature, "-", ".")) %>%
mutate(feature = str_replace_all(feature, "\\(", ".")) %>%
mutate(feature = str_replace_all(feature, "\\)", ".")) %>%
dplyr::rename(CW2 = observation.unit.identifierCW2.biochar.padj, CW3 = observation.unit.identifierCW3.biochar.padj) %>%
pivot_longer(cols = c(CW2, CW3), names_to = "name", values_to = "p.adj_LinDA") %>%
mutate(feature_group = paste0(feature, "_", name))
maaslin_p.adj<- fit_data_order$results %>% select(feature, value, qval) %>%
dplyr::rename(name = value, p.adj_Maaslin = qval) %>%
mutate(name = case_when(name == "CW2-biochar" ~ "CW2",
name == "CW3-biochar" ~ "CW3")) %>%
mutate(feature_group = paste0(feature, "_", name))
Although the same prevalence filtering was set in case of MaasLin2 and LinDA, the amount of tested features (taxa) is different. As we can see below all tested features included in MaAsLin2 analysis can also be found in LinDA analysis.However, it is not true for the reverse association.
all(maaslin_p.adj$feature %in% linda.out_p.adj$feature)
## [1] TRUE
all(linda.out_p.adj$feature %in% maaslin_p.adj$feature)
## [1] FALSE
The reason for that is that in the case of LinDA more features were
included in the analysis after applying the prevalence filter. In the
case of MaAsLin2, it is
{r length(fit_data_order\$results %\>% select(feature) %\>% unique %\>% pull(feature))}
whereas in the case of LinDA {r dim(linda.out)[1]}
.
Different number of features are left after prevalence filtering due to
the used formula: - MaAsLin2 keeps features when present in higher
amount of samples than the prevalence threshold (in this case >3,
which means at least 4). - LinDA keep features when at least a
prevalence threshold is reached (>=3, which means at least 3).
joined_result<- full_join(linda.out_p.adj, maaslin_p.adj[,c(3,4)], by = "feature_group")
joined_result_signif<-joined_result %>% filter(p.adj_LinDA<0.05 | p.adj_Maaslin<0.05) %>%
mutate(feature_name = paste(feature, "_", name))
The Venn diagrams presented below show the number of differentially abundant features which were detected using both pipelines (intersection of circles), only by LinDA (blue) and only by MaAsLin2 (red) when comparing pairwisely the types of CW.
LinDA_CW2<- joined_result_signif%>%filter(name == "CW2" & p.adj_LinDA<0.05) %>% pull(feature)
LinDA_CW3<- joined_result_signif%>%filter(name == "CW3" & p.adj_LinDA<0.05) %>% pull(feature)
Maaslin_CW2<- joined_result_signif%>%filter(name == "CW2" & p.adj_Maaslin<0.05) %>% pull(feature)
Maaslin_CW3<- joined_result_signif%>%filter(name == "CW3" & p.adj_Maaslin<0.05) %>% pull(feature)
Venn_CW2<- list(LinDA = LinDA_CW2, Maaslin2 = Maaslin_CW2)
Venn_CW3<- list(LinDA = LinDA_CW3, Maaslin2 = Maaslin_CW3)
Venn_CW2 <- Venn(Venn_CW2)
Venn_CW2 <- process_data(Venn_CW2, shape_id = "201")
venn_plot_CW2<-plot_venn(Venn_CW2)+
scale_fill_distiller(palette = "RdBu")+
ggtitle("CW1 vs. CW2")
Venn_CW3 <- Venn(Venn_CW3)
Venn_CW3 <- process_data(Venn_CW3, shape_id = "201")
venn_plot_CW3<- plot_venn(Venn_CW3)+
scale_fill_distiller(palette = "RdBu")+
ggtitle("CW1 vs. CW3")
ggarrange(venn_plot_CW2, venn_plot_CW3, nrow=2)
Over 70% of the total number of features were detected by both pipelines indicating a significant overlap in the results. In general, MaAsLin2 detects higher number of differentially abundant features in each comparison that were not detected by LinDA.
Features shown to be different using MaAsLin2 only:
Features shown to be different using LinDA only:
The relationship between log2 fold changes (between reference and tested group) calculated by LinDA and coefficients calculated by MaAsLin2 can be assessed by the following scatter plots. The strength of the relationship was measured by calculating a Pearson correlation coefficient (R).
linda.out_log_change<- linda.out %>% rownames_to_column(var = "feature") %>%
select(feature, CW2 = observation.unit.identifierCW2.biochar.log2FoldChange, CW3 = observation.unit.identifierCW3.biochar.log2FoldChange) %>%
mutate(feature = str_replace_all(feature, " ", ".")) %>%
mutate(feature = str_replace_all(feature, "-", ".")) %>%
mutate(feature = str_replace_all(feature, "\\(", ".")) %>%
mutate(feature = str_replace_all(feature, "\\)", ".")) %>%
pivot_longer(cols = c(CW2, CW3), names_to = "name", values_to = "log_change_LinDA") %>%
mutate(feature_name = paste0(feature, "_", name))
Maaslin_coef<- fit_data_order$results %>% select(feature, name = value, coef_Maaslin2 = coef) %>%
mutate(name = case_when(name == "CW2-biochar" ~ "CW2",
name == "CW3-biochar" ~ "CW3")) %>%
mutate(feature_name = paste0(feature, "_", name))
joined_result_coef<-linda.out_log_change %>% full_join(Maaslin_coef[,c(3,4)], by="feature_name") %>%
left_join(joined_result[,c(3,4,5)], by = c("feature_name" = "feature_group")) %>% mutate(signif = case_when(p.adj_Maaslin <=0.05 & p.adj_LinDA <= 0.05 ~ "signif_both",
p.adj_Maaslin <=0.05 & p.adj_LinDA > 0.05 ~ "signif_Maaslin2",
p.adj_Maaslin >0.05 & p.adj_LinDA <= 0.05 ~ "signif_LinDA",
is.na(p.adj_Maaslin) & p.adj_LinDA <= 0.05 ~ "signif_LinDA",
TRUE ~ "NS"))
joined_result_coef %>%
ggplot(aes(x = log_change_LinDA, y = coef_Maaslin2)) +
geom_jitter(shape = 21, alpha = 0.6, size = 2, width = 0.3, height = 0.1, aes(fill = signif))+
stat_cor(method = "pearson", label.x = -2, label.y = 2)+
facet_wrap(~name)+
scale_fill_nejm()+
theme_bw()
On the plot we can see that the direction of differences is the same. The two variables have a positive association and a strong linear relationship.
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_World.utf8 LC_CTYPE=English_World.utf8
## [3] LC_MONETARY=English_World.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_World.utf8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggVennDiagram_1.5.2 stringr_1.5.1 plyr_1.8.9
## [4] dplyr_1.1.4 MicrobiomeStat_1.2 biomeUtils_0.022
## [7] phylotools_0.2.2 Biostrings_2.72.1 GenomeInfoDb_1.40.1
## [10] XVector_0.44.0 IRanges_2.38.1 S4Vectors_0.42.1
## [13] BiocGenerics_0.50.0 ape_5.8 ComplexHeatmap_2.20.0
## [16] ranacapa_0.1.0 patchwork_1.3.0 GGally_2.2.1
## [19] Maaslin2_1.18.0 here_1.0.1 rstatix_0.7.2
## [22] kableExtra_1.4.0 ggsci_3.2.0 forcats_1.0.0
## [25] RColorBrewer_1.1-3 scales_1.3.0 pals_1.9
## [28] ggpubr_0.6.0 tidyr_1.3.1 tibble_3.2.1
## [31] ggtree_3.12.0 microViz_0.12.6 vegan_2.6-8
## [34] lattice_0.22-6 permute_0.9-7 microbiome_1.26.0
## [37] ggplot2_3.5.1 phyloseq_1.48.0 taxize_0.10.0
## [40] rentrez_1.2.3
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.1 urltools_1.7.3 ggplotify_0.1.2
## [4] triebeard_0.4.1 rpart_4.1.24 XML_3.99-0.18
## [7] lifecycle_1.0.4 Rdpack_2.6.2 doParallel_1.0.17
## [10] rprojroot_2.0.4 MASS_7.3-61 backports_1.5.0
## [13] magrittr_2.0.3 sass_0.4.9 rmarkdown_2.29
## [16] jquerylib_0.1.4 yaml_2.3.10 cowplot_1.1.3
## [19] pbapply_1.7-2 mapproj_1.2.11 DBI_1.2.3
## [22] minqa_1.2.8 ade4_1.7-22 maps_3.4.2.1
## [25] abind_1.4-8 zlibbioc_1.50.0 Rtsne_0.17
## [28] quadprog_1.5-8 purrr_1.0.4 hash_2.2.6.3
## [31] yulab.utils_0.2.0 seriation_1.5.7 circlize_0.4.16
## [34] GenomeInfoDbData_1.2.12 rmutil_1.1.10 ggrepel_0.9.6
## [37] tidytree_0.4.6 crul_1.5.0 pheatmap_1.0.12
## [40] spatial_7.3-18 svglite_2.1.3 codetools_0.2-20
## [43] getopt_1.20.4 xml2_1.3.6 tidyselect_1.2.1
## [46] shape_1.4.6.1 aplot_0.2.4 httpcode_0.3.0
## [49] UCSC.utils_1.0.0 farver_2.1.2 stable_1.1.6
## [52] lme4_1.1-36 TSP_1.2-4 matrixStats_1.5.0
## [55] jsonlite_1.8.9 GetoptLong_1.0.5 multtest_2.60.0
## [58] Formula_1.2-5 survival_3.7-0 iterators_1.0.14
## [61] systemfonts_1.2.1 foreach_1.5.2 tools_4.4.1
## [64] treeio_1.28.0 Rcpp_1.0.13 glue_1.8.0
## [67] xfun_0.49 mgcv_1.9-1 ca_0.71.1
## [70] numDeriv_2016.8-1.1 withr_3.0.2 timeSeries_4041.111
## [73] fastmap_1.2.0 boot_1.3-31 rhdf5filters_1.16.0
## [76] digest_0.6.37 R6_2.6.0 gridGraphics_0.5-1
## [79] colorspace_2.1-1 Cairo_1.6-2 dichromat_2.0-0.1
## [82] modeest_2.4.0 utf8_1.2.4 generics_0.1.3
## [85] data.table_1.16.2 DECIPHER_3.0.0 robustbase_0.99-4-1
## [88] httr_1.4.7 ggstats_0.8.0 pkgconfig_2.0.3
## [91] gtable_0.3.6 timeDate_4041.110 registry_0.5-1
## [94] picante_1.8.2 pcaPP_2.0-5 htmltools_0.5.8.1
## [97] carData_3.0-5 biomformat_1.32.0 clue_0.3-66
## [100] logging_0.10-108 Biobase_2.64.0 png_0.1-8
## [103] optparse_1.7.5 reformulas_0.4.0 ggfun_0.1.8
## [106] knitr_1.49 rstudioapi_0.17.1 reshape2_1.4.4
## [109] rjson_0.2.23 statip_0.2.3 nloptr_2.1.1
## [112] nlme_3.1-166 curl_6.2.0 cachem_1.1.0
## [115] zoo_1.8-12 rhdf5_2.48.0 GlobalOptions_0.1.2
## [118] parallel_4.4.1 fBasics_4041.97 pillar_1.10.1
## [121] vctrs_0.6.5 car_3.1-3 cluster_2.1.6
## [124] evaluate_1.0.3 mvtnorm_1.3-3 cli_3.6.3
## [127] compiler_4.4.1 rlang_1.1.4 crayon_1.5.3
## [130] ggsignif_0.6.4 labeling_0.4.3 fs_1.6.5
## [133] stringi_1.8.4 viridisLite_0.4.2 lmerTest_3.1-3
## [136] munsell_0.5.1 lazyeval_0.2.2 Matrix_1.7-1
## [139] stabledist_0.7-2 Rhdf5lib_1.26.0 statmod_1.5.0
## [142] rbibutils_2.3 igraph_2.1.1 broom_1.0.7
## [145] bslib_0.9.0 phangorn_2.12.1 biglm_0.9-3
## [148] fastmatch_1.1-6 DEoptimR_1.1-3-1