Introduction

The following script shows basic analyses of 16S rRNA amplicon sequencing data.If needed RMarkdown file can be downloaded here.
Script includes some pre-processing steps, as well as calculations, statistical analyses and visualizations. Raw paired-end read files were used as input to generate de novo Amplicon Sequence Variants (ASVs) with NG-Tax 2.0.

This sample dataset comes from a project aimed at studying the role of microbial communities on micropollutant removal in bark-biochar vertical subsurface flow constructed wetlands. Since conventional wastewater treatment plants are inefficient in removing micropollutants, there is a need for post-treatment methods that are feasible and have high removal efficiency. One of such treatment methods might be constructed wetlands (CW), which are man-made structures built to imitate the mechanisms of natural wetland systems, such as shallow ponds, mangroves, trenches, etc. In CW, a basin is filled with the solid material on which microorganisms can form biofilms. Subsequently, natural processes, such as biodegradation, sorption and plant uptake, result in the removal of micropollutants.

In this project CW basins were filled with a following solid filling material:

Sand (group CW1-sand) - which is conventional filling
Combination of sand and bark-biochar (built with an innovative layer of a combination of bark and biochar in between layers of sand) (CW2-biochar)
Layer of bark-biochar only (CW3-biochar)

The water was distributed over the top of the CW and subsequently kept in the system for 8 h. One of the aims of the study was to identify the microorganisms that are present in 3 types of constructed wetlands with different solid filling material over the depth of the constructed wetland. Samples were taken from 3 CW from the following depths:

30 cm
50 cm
70 cm
90 cm

All constructed wetlands were divided into 2 parts (A and D) and samples were taken from both locations. Additionally, samples were taken in winter and summer, since the temperature and other environmental factors might have the impact on the composition of the wetlands microbiome and therefore their functional properties.

Preparations

Attaching necessary libraries

knitr::opts_chunk$set(echo = TRUE, include=TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra = "")
library(rentrez)
library(taxize)
library(phyloseq)
library(microbiome)
library(vegan)
library(microViz)
library(ggtree)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(pals)
library(scales)
library(RColorBrewer)
library(forcats)
library(ggsci)
library(kableExtra)
library(rstatix)
library(here)
library(Maaslin2)
library(GGally)
library(patchwork)
library(ranacapa)
library(ComplexHeatmap)
library(ape)
library(Biostrings)
library(phylotools)
library(biomeUtils)
library(MicrobiomeStat)
library(dplyr)
library(plyr)
library(stringr)
library(ggVennDiagram)

Setting the work directory

Remember to indicate the directory where all your files will be saved. The working directory should contain the biom file as exported by NG-Tax 2.0 and the metadata file. The metadata file should include descriptive information about your experiment, such as collection date, associated environment or location, used to contextualize and analyze data.

Below an example is given. This code will set the working directory to the path where the project is located.

here::i_am("basic_script.Rmd")
script_path <- here::here()
cat("The project root is located at:", script_path, "\n")

## The project root is located at: C:/Users/malin009/OneDrive - Wageningen University & Research/Dry_lab_intro/tutorials/16S/wetlands

setwd(script_path)

Loading the NGTax output and creating a phyloseq object

Phyloseq objects are commonly used for analyzing microbiome data in R. The phyloseq object can contain the following objects:

otu_table(): a table that contains in the columns the sample_ID and the rows the OTUs/ASVs, the OTU/ASV abundance (count) table per sample
tax_table():A table that presents the taxonomic assignment of each OTU
sample_data(): A table that contains the metadata of the samples
phy_tree(): The phylogenetic tree representing the evolutionary relationships between the microbial taxa in the dataset
refseq(): DNA string set - OTUs/ASVs sequences

The biom file created by NG-Tax 2.0 is composed only of otu_table and tax_table. Metadata and taxonomy tree need to be attached separately. In the metadata file, sample identifiers need to be the same as in the mapping file for NG-Tax 2.0. Additionally, if the composition of reference mocks 3 and 4 needs to be added manually, add 2 samples in the metadata file, namely refmock3 and refmock4. Only sample identifiers need to be given. All other columns can be blank.

The taxonomy tree is stored in the NG-Tax output under the name “all_otus.tree” and all sequences used to generate the tree are stored in the file “all_otus.fasta”.

ps <- read_phyloseq(otu.file = "wetlands_filling_tmock.biom", 
                           taxonomy.file = NULL,
                           metadata.file = "metadata_UNLOCK_wetlands.csv", 
                           type = "biom")

treefile <- read.tree("all_otus_mock.tree")
ps <- merge_phyloseq(ps, treefile)
print(ps)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4074 taxa and 69 samples ]
## sample_data() Sample Data:       [ 69 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 4074 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4074 tips and 4073 internal nodes ]

ps_sum<- summarize_phyloseq(ps)
data.frame(do.call(rbind, ps_sum[c(1:10)]))

##                                                                          do.call.rbind..ps_sum.c.1.10...
## 1                                                                           1] Min. number of reads = 14
## 2                                                                       2] Max. number of reads = 917500
## 3                                                                     3] Total number of reads = 3018962
## 4                                                          4] Average number of reads = 43753.0724637681
## 5                                                                      5] Median number of reads = 17925
## 6                                                                        7] Sparsity = 0.929852084267145
## 7                                                                        6] Any OTU sum to 1 or less? NO
## 8                                                                            8] Number of singletons = 0
## 9  9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0
## 10                                                                 10] Number of sample variables are: 7

Here, you can check if your phyloseq object includes all the necessary data. Plus, the summary prints basic information of the data (reads, sparsity, singletons).

Inspecting parts of the phyloseq object

This code will return first 10 rows of otu table.

head(ps@otu_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

	1_CW1_A_D90_1	1_CW2_D_D70_1	1_CW3_A_D70_1	1_CW3_A_D70_2	1_CW3_A_D93_1	1_CW3_A_D93_2	1_CW3_D_D93_1	1_CWI_D_D90_2	2_CW1_A_50_1	2_CW1_A_90_2	2_CW1_D_50_1	2_CW2_A_50_1	2_CW2_A_70_1	2_CW3_A_70_1	2_CW3_A_70_2
1632941857	0	70	0	0	0	0	0	0	0	0	0	0	0	0	0
1632942152	0	0	23	12	0	0	0	0	0	0	0	0	0	0	0
1632942190	0	0	0	35	0	0	0	0	0	0	0	83	57	46	21
1632941008	15	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1632942270	0	0	0	0	35	21	0	0	0	0	0	0	0	0	0
1632941052	13	0	0	0	0	0	0	0	0	0	0	0	0	0	0
163294987	20	0	0	0	0	0	0	0	0	0	0	0	0	0	0
163294896	133	0	0	0	0	0	0	0	39	60	57	0	0	0	0
1632942641	0	0	0	0	0	0	0	6	0	0	0	0	0	0	0
1632942416	0	0	0	0	0	0	52	0	0	0	0	0	0	0	0

The samples are in columns and ASVs are in rows. The values in the table indicate the absolute abundance (counts) of reads per ASV per sample.

This code will return first 10 rows of taxonomy table.

head(ps@tax_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

	Domain	Phylum	Class	Order	Family	Genus	Species
1632941857	NA	p__	c__	o__	f__	g__	s__
1632942152	NA	p__	c__	o__	f__	g__	s__
1632942190	NA	p__	c__	o__	f__	g__	s__
1632941008	NA	p__	c__	o__	f__	g__	s__
1632942270	NA	p__	c__	o__	f__	g__	s__
1632941052	NA	p__	c__	o__	f__	g__	s__
163294987	NA	p__	c__	o__	f__	g__	s__
163294896	NA	p__	c__	o__	f__	g__	s__
1632942641	NA	p__	c__	o__	f__	g__	s__
1632942416	k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__

We can see that many ASVs are not assigned to any taxonomy (“NA”). What is more, in some cases there is missing genus (“g__”) or family (“f__”) level taxonomy. This can happen for several reasons, such as lack of reference sequences in the database or low sequence similarity. When there are many mismatches with a sequence in the database or the sequences cannot be distinguished with high confidence, NG-Tax does not assign taxonomy or classifies one rank lower to ensure correct taxonomic assignment.

Domain	Phylum	Class	Order	Family	Genus	Species
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__	g__	s__
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	o__Rickettsiales	f__Mitochondria	g__	s__
k__Bacteria	p__Cyanobacteria	c__Cyanobacteriia	o__Chloroplast	f__	g__	s__
k__Bacteria	p__Bacteroidota	c__Bacteroidia	o__Bacteroidales	f__	g__	s__

It is good to clean a bit the taxonomy table before you proceed with the analysis and for example change the names of ASVs in such a way that the highest known taxonomy name is always shown (for example when Family is known, but for Genus no annotation was found). This way each ASV will keep as much of individuality as possible. It can be done manually or using the tax_fix function from microViz package (see Preprocessing Steps below).

To inspect the phylogenetic tree attached to the phyloseq object and generated from the ASV sequences, the following code can be used:

ggtree(ps@phy_tree, options(ignore.negative.edge=TRUE))

Metadata information can also be inspected as shown below. The code will print the first 10 rows of the sample data attached to the phyloseq object.

metadata<- ps %>% meta()
head(metadata, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

	sample.description	sample.name	observation.unit.identifier	collection.date	season	location	depth
1_CW1_A_D30_1	Lib1_CW1-sand_A_D30_30_Jan_23_Sandfilled	1_Lib1_CW1-sand_A_D30	CW1-sand	1/30/2023	winter	A	30
1_CW1_A_D50_1	Lib1_CW1-sand_A_D50_30_Jan_23_Sandfilled	2_Lib1_CW1-sand_A_D50	CW1-sand	1/30/2023	winter	A	50
1_CW1_A_D50_2	Lib1_CW1-sand_A_D50_30_Jan_23_Sandfilled	3_Lib1_CW1-sand_A_D50	CW1-sand	1/30/2023	winter	A	50
1_CW1_A_D70_1	Lib1_CW1-sand_A_D70_30_Jan_23_Sandfilled	4_Lib1_CW1-sand_A_D70	CW1-sand	1/30/2023	winter	A	70
1_CW1_A_D70_2	Lib1_CW1-sand_A_D70_30_Jan_23_Sandfilled	5_Lib1_CW1-sand_A_D70	CW1-sand	1/30/2023	winter	A	70
1_CW1_A_D90_1	Lib1_CW1-sand_A_D90_30_Jan_23_Sandfilled	6_Lib1_CW1-sand_A_D90	CW1-sand	1/30/2023	winter	A	90
1_CW1_D_D30_1	Lib1_CW1-sand_D_D30_30_Jan_23_B/B1filled	7_Lib1_CW1-sand_D_D30	CW1-sand	1/30/2023	winter	D	30
1_CW1_D_D50_1	Lib1_CW1-sand_D_D50_30_Jan_23_B/B1filled	8_Lib1_CW1-sand_D_D50	CW1-sand	1/30/2023	winter	D	50
1_CW2_A_D50_2	Lib1_CW2-biochar_A_D50_30_Jan_23_B/B1filled	9_Lib1_CW2-biochar_A_D50	CW2-biochar	1/30/2023	winter	A	50
1_CW2_A_D70_1	Lib1_CW2-biochar_A_D70_30_Jan_23_B/B1filled	10_Lib1_CW2-biochar_A_D70	CW2-biochar	1/30/2023	winter	A	70

Preprocessing steps

Cleaning taxonomy table

In the process of 16S gene amplification some eukaryotic sequences, like chloroplasts or mitochondria, are likely to be amplified and should be identified and discarded (Hanshew et al., 2013). The code below shows the number of ASVs annotated as “o__Chloroplast” and “f__Mitochondria” and calculates the fraction of these ASVs. Before filtering out contaminant sequences, it is important to know their proportion to have a better overview of the data.

Chloroplast <- ntaxa(subset_taxa(ps, Order == "o__Chloroplast"))
Mitochondria <- ntaxa(subset_taxa(ps, Family == "f__Mitochondria"))
chl_mt_df<- data.frame(Taxa = c("Chloroplast", "Mitochondria"),
                       nASVs = c(Chloroplast, Mitochondria)) %>%
  mutate(fraction = round(nASVs/ntaxa(ps), 3))
chl_mt_df

##           Taxa nASVs fraction
## 1  Chloroplast    48    0.012
## 2 Mitochondria     8    0.002

The code below will discard ASVs classified as Mitochondria and Chloroplasts.

ps_filt<- subset_taxa(ps, Order != "o__Chloroplast")
ps_filt<- subset_taxa(ps_filt, Family != "f__Mitochondria")

The primers used for 16S rRNA amplification are designed to target both Bacteria and Archaea and therefore, it’s expected to find both in the dataset. Additionally, in the taxonomy table there might be other ASVs that are not annotated to any known Domain (“NA”). The table below shows the number of ASVs and their fraction classified at the Domain level, including ASVs of unknown Domain.

ps_filt@tax_table %>% data.frame() %>% group_by(Domain) %>% dplyr::count() %>%
  mutate(fraction = round(n/ntaxa(ps_filt), 3))

## # A tibble: 3 × 3
## # Groups:   Domain [3]
##   Domain          n fraction
##   <chr>       <int>    <dbl>
## 1 NA             55    0.014
## 2 k__Archaea     47    0.012
## 3 k__Bacteria  3916    0.975

There are 55 ASVs (1.4%) not assigned to any Domain and 47 ASVs (1.2%) assigned to Archaea. You may check the ASVs sequences for unassigned taxa by searching in the “all_otus.fasta” and then Blast them to make sure these are not bacteria. The code below parses the sequences of unassigned ASVs into a separate file that can be uploaded into BLAST.

all_asvs <- readDNAStringSet("all_otus.fasta")
unassigned_asvs <- taxa_names(subset_taxa(ps, Domain == "NA") %>% tax_table())
all_asvs_df <- as.data.frame(all_asvs)
unassigned_asvs <- all_asvs_df %>% filter(row.names(all_asvs_df) %in% unassigned_asvs) %>%
       rownames_to_column(var = "seq.name") %>% `colnames<-`(c("seq.name", "seq.text"))
dat2fasta(unassigned_asvs, outfile = "out.fasta")

## out.fasta has been saved to  C:/Users/malin009/OneDrive - Wageningen University & Research/Dry_lab_intro/tutorials/16S/wetlands

The blast is done against “core nucleotide database” and we advise to limit maximum target sequences to 10, this way it is easier to go through all the data. To upload the result in R, a result of blast in csv format should be downloaded (here: “out.csv”). It is not possible to download result with taxonomy description, therefore packages rentrez and taxize will be used to retrieve this information. They use API to get the data from ncbi website.

blast_df <- read.csv("out.csv", header = FALSE)
colnames(blast_df) <- c("ASV", "sseqid", "per.ident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
sseqid<- unique(blast_df$sseqid)
taxa<- genbank2uid(sseqid)
taxa_df<- ldply(taxa) %>% dplyr::rename(id = V1) %>% mutate(GenBank = sseqid, name = sapply(taxa, function(x) attr(x, "name")))
tax_ids<-unique(taxa_df$id)
tax_summary<- entrez_summary(tax_ids, db = "taxonomy")
entrez_summary<-ldply(tax_summary, data.frame)
entrez_summary_sel<-entrez_summary %>%select(taxid, division, scientificname, genus, species, genbankdivision) %>% mutate(taxid = as.character(taxid))
blast_result<- blast_df %>% left_join(taxa_df, by=c("sseqid" = "GenBank")) %>% left_join(entrez_summary_sel, by=c("id" = "taxid"))

In order to decide whether each ASV comes from bacteria or not we will leave only the top result for each ASV (the one that has the lowest E value, which is the number of expected hits of similar quality (score) that could be found just by chance). Information about classification of taxa is stored as “division”. To simplify the output we will add a new column which will inform whether a sequence comes from bacteria, archae or eukaryota. All remaining ones are “unclassified sequences”. First we need to check what unique “divisions” are in our output and decide about new, simplified grouping (Be careful - this part of code is specific to the output).

unique(blast_result$division)

##  [1] "bacteria"                        "d-proteobacteria"               
##  [3] "ascomycete fungi"                "unclassified sequences"         
##  [5] "diatoms"                         "proteobacteria"                 
##  [7] "green algae"                     "eukaryotes"                     
##  [9] "brown algae"                     "archaea"                        
## [11] "euryarchaeotes"                  "firmicutes"                     
## [13] "flowering plants"                "high G+C Gram-positive bacteria"
## [15] "CFB group bacteria"

Bacteria<- c("bacteria", "d-proteobacteria", "proteobacteria", "firmicutes", "high G+C Gram-positive bacteria", "CFB group bacteria")
Eukaryota<- c("ascomycete fungi", "diatoms", "green algae", "eukaryotes", "brown algae", "flowering plants")
Archaea<- c("archaea", "euryarchaeotes")
blast_result<- blast_result %>% mutate(Domain = case_when(division %in% Bacteria ~ "Bacteria",
                                                division %in% Eukaryota ~ "Eukaryota",
                                                division %in% Archaea ~ "Archaea",
                                                TRUE ~ division))
blast_annot<- blast_result %>% group_by(ASV) %>% slice_head(n=1)%>% select(ASV, Domain)
Bacteria_ASVs<- blast_annot %>% filter(Domain == "Bacteria") %>% pull(ASV)
Archaea_ASVs<- blast_annot %>% filter(Domain == "Archaea") %>% pull(ASV)
Eukaryota_ASVs<- blast_annot %>% filter(Domain == "Eukaryota") %>% pull(ASV)
unclassified_prokaryote_ASVs<- blast_annot %>% filter(Domain == "unclassified sequences") %>% pull(ASV)

Only ASVs annotated in blast output as bacteria will be kept in the final phyloseq object. They account for 54.5454545 % of all blasted sequences.

The code below shows how to manually change annotations in the tax_table so all ASVs that are Bacteria but not annotated with SILVA will get the Bacteria domain name. For all the other taxonomy levels the names will remain to be NA. After performing this manual annotation, we will check the the number of ASVs in each Domain.

ps_filt<-ps_filt %>% tax_mutate(ASV = taxa_names(ps_filt)) %>%
       tax_mutate(Domain = case_when(ASV %in% Bacteria_ASVs ~ "k__Bacteria",
                                     ASV %in% Archaea_ASVs ~ "k__Archaea",
                                     ASV %in% Eukaryota_ASVs ~ "k__Eukaryota",
                                     ASV %in% unclassified_prokaryote_ASVs ~ "k__unclassified_prokaryota",
                                               TRUE ~ Domain))
ps_filt@tax_table %>% data.frame() %>% group_by(Domain) %>% dplyr::count() %>%
  mutate(fraction = round(n/ntaxa(ps_filt), 3))

## # A tibble: 4 × 3
## # Groups:   Domain [4]
##   Domain                         n fraction
##   <chr>                      <int>    <dbl>
## 1 k__Archaea                    49    0.012
## 2 k__Bacteria                 3946    0.982
## 3 k__Eukaryota                  16    0.004
## 4 k__unclassified_prokaryota     7    0.002

Depending on the aim of the study, ASVs belonging to both Archaea and Bacteria might be of interest. If the project focus is on bacteria only, all ASVs that are not bacteria should be filtered out. It is important to keep all taxa that are bacteria, otherwise it can disturb the relative abundance of each taxon. For the purpose of this tutorial, we decided to leave only Bacteria in the dataset.

ps_filt<- subset_taxa(ps_filt, Domain == "k__Bacteria")
ps_filt@tax_table %>% data.frame() %>% group_by(Phylum) %>% dplyr::count() %>%
  arrange(desc(n)) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

Phylum	n
p__Proteobacteria	1117
p__Planctomycetota	438
p__Bacteroidota	390
p__Acidobacteriota	320
p__Chloroflexi	304
p__Verrucomicrobiota	214
p__Actinobacteriota	208
p__Firmicutes	203
p__Myxococcota	150
p__Desulfobacterota	128
p__Gemmatimonadota	77
p__Spirochaetota	52
p__Nitrospirota	41
p__Latescibacterota	39
p__Bdellovibrionota	33
p__Cyanobacteria	31
p__	30
p__Patescibacteria	24
p__Armatimonadota	23
p__Fibrobacterota	16
p__Zixibacteria	15
p__SAR324_clade(Marine_group_B)	11
p__Elusimicrobiota	10
p__FCPU426	10
p__Dependentiae	9
p__MBNT15	9
p__RCP2-54	9
p__Entotheonellaeota	7
p__Methylomirabilota	4
p__Sumerlaeota	4
p__Campylobacterota	2
p__Cloacimonadota	2
p__PAUC34f	2
p__WPS-2	2
p__AncK6	1
p__DTB120	1
p__Deferrisomatota	1
p__Deinococcota	1
p__Fermentibacterota	1
p__Fusobacteriota	1
p__Hydrogenedentes	1
p__Margulisbacteria	1
p__NB1-j	1
p__Sva0485	1
p__Synergistota	1
p__WS2	1

Fixing taxonomy table

Now, the table includes only ASVs classified to a known domain. For other taxonomic levels, there are some unassigned ASVs. The following code will replace the uninformative or unknown entries (“unknowns”) with information from the highest known taxonomic rank (“classified”). It is important to understand the new annotation of taxa. Therefore, we recommend further reading of the tax_fix documentation. Important to understand is that when a taxon is annotated with the full name of another taxonomic rank (for example a genus is annotated with f__Enterobacteriaceae Family), it means that this taxon (genus in this example) is composed of all taxa annotated to the lower taxonomic rank (highest known - in this case Enterobacteriaceae family) but not annotated with the analyzed taxonomic rank (in this case genus).

ps_filt_fixed<- ps_filt %>%
 tax_fix(
  min_length = 4,
  unknowns = c("g____", "f____", "o____", "g__uncultured_bacterium", "g__uncultured", "f__uncultured", "f__uncultured_bacterium", "o__uncultured", "o__uncultured_bacterium","o__uncultured_planctomycete", "o__uncultured_soil_bacterium", "c__uncultured", "c__uncultured_bacterium", "c__metagenome", "f__uncultured_archaeon", "f__uncultured_bacterium", "f__uncultured_euryarchaeote", "c__uncultured_bacterium Class", "f__metagenome Family", "o__metagenome", "o__uncultured_Chloroflexi_bacterium", "f__metagenome", "f__uncultured_Acidobacteria_bacterium", "f__uncultured_soil_bacterium", "f__uncultured_actinobacterium", "f__uncultured_cyanobacterium", "f__uncultured_delta_proteobacterium", "f__uncultured_gamma_proteobacterium", "f__uncultured_organism", "g__metagenome", "g__uncultured_Acidobacteria_bacterium", "g__uncultured_Cytophagales_bacterium", "g__uncultured_soil_bacterium", "g__uncultured_Bacteroidetes_bacterium", "g__uncultured_Chloroflexi_bacterium", "g__uncultured_organism", "g__uncultured_planctomycete"),
  sep = " ", anon_unique = TRUE,
  suffix_rank = "classified"
 )

The genus names look like that right now:

ps_filt_fixed@tax_table %>%
  data.frame() %>%
  arrange(Genus)%>%
  head(n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

Domain	Phylum	Class	Order	Family	Genus	Species	ASV
k__Bacteria	p__Nitrospirota	c__4-29-1	c__4-29-1 Class	c__4-29-1 Class	c__4-29-1 Class	c__4-29-1 Class	1632943206
k__Bacteria	p__Actinobacteriota	c__Acidimicrobiia	c__Acidimicrobiia Class	c__Acidimicrobiia Class	c__Acidimicrobiia Class	c__Acidimicrobiia Class	1632943971
k__Bacteria	p__Actinobacteriota	c__Acidimicrobiia	c__Acidimicrobiia Class	c__Acidimicrobiia Class	c__Acidimicrobiia Class	c__Acidimicrobiia Class	1632943272
k__Bacteria	p__Actinobacteriota	c__Acidimicrobiia	c__Acidimicrobiia Class	c__Acidimicrobiia Class	c__Acidimicrobiia Class	c__Acidimicrobiia Class	1632942591
k__Bacteria	p__Acidobacteriota	c__Acidobacteriae	c__Acidobacteriae Class	c__Acidobacteriae Class	c__Acidobacteriae Class	c__Acidobacteriae Class	163294725
k__Bacteria	p__Acidobacteriota	c__Acidobacteriae	c__Acidobacteriae Class	c__Acidobacteriae Class	c__Acidobacteriae Class	c__Acidobacteriae Class	1632943056
k__Bacteria	p__Actinobacteriota	c__Actinobacteria	c__Actinobacteria Class	c__Actinobacteria Class	c__Actinobacteria Class	c__Actinobacteria Class	1632942355
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	1632941537
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	1632942154
k__Bacteria	p__Proteobacteria	c__Alphaproteobacteria	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	c__Alphaproteobacteria Class	s____	1632942601

Quality control

It is highly recommended to have both negative and positive controls in the sequenced set of samples. Negative controls are crucial for identifying contaminants from reagents (PCR reagents, extraction kits), used equipment or introduced during sample preparation. Typically, they are no-template controls where water used during DNA extraction or PCR amplification is added instead. Negative controls are expected to have amount of reads close to zero. Positive controls are important for ensuring the accuracy and reliability of the sequencing process. In this dataset, mock communities are used as positive controls. Mock communities are synthetic communities of known composition, in this case typical for human gut, used to benchmark NG-Tax (Mock 3 and 4).

Calculating sum of reads

The table below shows the number of reads per sample. It is arranged in ascending order. Visualizing the sum of reads per sample will help in case you have high number of samples.

sum_of_reads<- as.data.frame(sample_sums(ps_filt_fixed@otu_table)) %>% rownames_to_column(var = "sampleID") %>%
 dplyr::rename(sum_of_reads = `sample_sums(ps_filt_fixed@otu_table)`)%>%
  left_join((metadata %>% rownames_to_column(var = "sampleID")), by = "sampleID")
head((sum_of_reads %>% arrange(sum_of_reads)),n=20)%>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

sampleID	sum_of_reads	sample.description	sample.name	observation.unit.identifier	collection.date	season	location	depth
Water_5	14	Control_with_sterile_water_only_to_check	Water_control	Negative_Control		summer		NA
Water_3	138	Control_with_sterile_water_only_to_check	Water_control	Negative_Control		summer		NA
Water_1	202	Control_with_sterile_water_only_to_check	Water_control	Negative_Control		summer		NA
Water_2	724	Control_with_sterile_water_only_to_check	Water_control	Negative_Control		summer		NA
Water_4	1259	Control_with_sterile_water_only_to_check	Water_control	Negative_Control		summer		NA
1_CW2_D_D93_1	4686	Lib1_CW2-biochar_D_D93_30_Jan_23_B/B1filled	15_Lib1_CW2-biochar_D_D93	CW2-biochar	1/30/2023	winter	D	93
2_CW3_D_50_2	5829	Lib2_CW3-biochar_D_50_20_June_23_Sandfilled	57_Lib2_CW3-biochar_D_50	CW3-biochar	6/20/2023	summer	D	50
1_CWI_D_D90_2	6018	Lib1_CW1-sand_D_D90_30_Jan_23_B/B2filled	32_Lib1_CW1-sand_D_D90	CW1-sand	1/30/2023	winter	D	90
1_CWI_D_D70_2	6614	Lib1_CW1-sand_D_D70_30_Jan_23_B/B2filled	30_Lib1_CW1-sand_D_D70	CW1-sand	1/30/2023	winter	D	70
2_CW1_A_50_2	8087	Lib2_CW1-sand_A_50_20_June_23_B/B2filled	34_Lib2_CW1-sand_A_50	CW1-sand	6/20/2023	summer	A	50
2_CW3_A_90_1	9219	Lib2_CW3-biochar_A_90_20_June_23_Sandfilled	54_Lib2_CW3-biochar_A_90	CW3-biochar	6/20/2023	summer	A	90
Mock3	9307	Mock_3_Community_experimental	Mock_3_Community	Mock_Community		summer		NA
2_CW1_D_70_1	9434	Lib2_CW1-sand_D_70_20_June_23_B/B2filled	40_Lib2_CW1-sand_D_70	CW1-sand	6/20/2023	summer	D	70
1_CW1_A_D90_1	9448	Lib1_CW1-sand_A_D90_30_Jan_23_Sandfilled	6_Lib1_CW1-sand_A_D90	CW1-sand	1/30/2023	winter	A	90
2_CW1_D_50_2	9919	Lib2_CW1-sand_D_50_20_June_23_B/B2filled	39_Lib2_CW1-sand_D_50	CW1-sand	6/20/2023	summer	D	50
2_CW2_D50_1	10470	Lib2_CW2-biochar_D50__20_June_23_Sandfilled	49_Lib2_CW2-biochar_D50_	CW2-biochar	6/20/2023	summer	D	50
1_CW3_A_D70_2	11188	Lib1_CW3-biochar_A_D70_30_Jan_23_B/B2filled	19_Lib1_CW3-biochar_A_D70	CW3-biochar	1/30/2023	winter	A	70
2_CW1_A70_1	12823	Lib2_CW1-sand_A70__20_June_23_B/B2filled	37_Lib2_CW1-sand_A70_	CW1-sand	6/20/2023	summer	A	70
1_CW3_D_D50_1	12976	Lib1_CW3-biochar_D_D50_30_Jan_23_B/B2filled	22_Lib1_CW3-biochar_D_D50	CW3-biochar	1/30/2023	winter	D	50
2_CW3_A_70_2	13035	Lib2_CW3-biochar_A_70_20_June_23_Sandfilled	53_Lib2_CW3-biochar_A_70	CW3-biochar	6/20/2023	summer	A	70

sum_of_reads %>% arrange(sum_of_reads) %>% filter(!sampleID %in% c("refmock3", "refmock4"))%>%
  ggplot(aes(x = reorder(sampleID, sum_of_reads), y = sum_of_reads, fill = observation.unit.identifier)) +
  geom_point(shape = 21, size = 3, color = "black")+
  scale_fill_nejm()+
  labs(x = "sample ID")+
  theme_bw()+
  theme(axis.text.x=element_text(angle=90))

As shown on the plot and in the table -negative controls have the lowest read count read count, which is 14, 138, 208, 724 and 1259. Since all experimental samples have higher read count than controls - none of the samples is discarded.

Inspecting positive control - mock 3

First, a check will be done to see whether the mocks that were processed and sequenced in the laboratory, resemble the expected (theoretical) values and can reproduce the theoretical composition. If NG-Tax pipeline was used - the composition of theoretical mocks is added during the NG-Tax sample processing (provided this option was marked in the NG-Tax options) and is present in the output biom file. If another pipeline or another mocks are used, the expected abundances of taxa in these mocks needs to be attached to the abundance table of the sequenced mocks. We advice to work on genus level and attach it to the ‘melted’ phyloseq data frame (the object created with `ps_melt’ function - mock3_melt_Genus_rel or mock4_melt_rel), which is the so called long table. A function that might be useful is the ?bind_rows from dplyr package. The names of the columns in the table that will be attached needs to be the same as the names in the melted phyloseq. Mock 3 and 4 are different in composition therefore they need to be analyzed separately. To do it a separate phyloseq with mock3 samples will be created. Then a barplot will be printed to visually assess similarity of the composition of experimental and theoretical mocks. In the code we select the top taxa top plot (n=25), so that the plot is not too crowded.

ps_mock3<- ps_filt_fixed %>% ps_filter(sample.name == "Mock_3_Community")
ps_mock3_Genus_rel<- ps_mock3 %>% tax_transform("compositional", rank = "Genus")
top_taxa_mock3<- tax_top(ps_mock3_Genus_rel, n=25)
mock3_melt_Genus_rel<- ps_mock3_Genus_rel %>% psmelt() %>%
  mutate(Genus_top = case_when(Genus %in% top_taxa_mock3 ~ Genus,
                                                      TRUE ~ "other")) %>%
  mutate(Genus_top = as.factor(Genus_top)) %>%
  mutate(Genus_top = fct_relevel(Genus_top, c(top_taxa_mock3, "other")))

mock3_melt_Genus_rel %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Genus_top)) +
       geom_bar(stat = "identity", color = "black")+
       scale_fill_manual(values = distinct_palette(25, pal = "brewerPlus", add = "grey70"))+
       theme_bw()+
       theme(axis.title.x = element_blank(), legend.position = "right")

Apart from barplots it is also good to see how the abundance of genera in experimental mocks correlate with the abundance in theoretical mocks. To do so, You can plot scatter plots and inspect the Spearman correlation coefficient (they are shown at the intersection of the theoretical and experimental mock abundances). Additionally density plots are shown for each variable.

mock3_melt_Genus_rel %>% select(OTU, Sample, Abundance) %>%
  pivot_wider(names_from = Sample, names_prefix = "Abundance_", values_from = "Abundance") %>%
  column_to_rownames(var = "OTU")%>% ggpairs(upper = list(continuous = wrap("cor", method = "spearman")))

Inspecting positive control - mock 4

Now the same will be done with mock 4

ps_mock4<- ps_filt_fixed %>% ps_filter(sample.name == "Mock_4_Community")

ps_mock4_Genus_rel<- ps_mock4 %>% tax_transform("compositional", rank = "Genus")
top_taxa_mock4<- tax_top(ps_mock4_Genus_rel, n=25)
mock4_melt_Genus_rel<- ps_mock4_Genus_rel %>% psmelt() %>%
  mutate(Genus_top = case_when(Genus %in% top_taxa_mock4 ~ Genus,
                                                      TRUE ~ "other")) %>%
  mutate(Genus_top = as.factor(Genus_top)) %>%
  mutate(Genus_top = fct_relevel(Genus_top, c(top_taxa_mock4, "other")))

mock4_melt_Genus_rel %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Genus_top)) +
       geom_bar(stat = "identity", color = "black")+
       scale_fill_manual(values = distinct_palette(25, pal = "brewerPlus", add = "grey70"))+
       theme_bw()+
       theme(axis.title.x = element_blank(), legend.position = "right")

and correlation

mock4_melt_Genus_rel %>% select(OTU, Sample, Abundance) %>%
  pivot_wider(names_from = Sample, names_prefix = "Abundance_", values_from = "Abundance") %>%
  column_to_rownames(var = "OTU")%>% ggpairs(upper = list(continuous = wrap("cor", method = "spearman")))

Rare taxa exploration

Rare taxa refer to taxa present at low relative abundance or in a small number of samples across the data set. Rare biosphere implicates low-abundance taxa as important contributors to the assessment of a- and b-diversity (Lynch et al., 2015). However, recent microbiome quality control studies indicate that many rare taxa might be created by sequencing artifacts, errors and/or contamination in SSU rRNA gene surveys. To address the challenge of identifying the “true” origin of these taxa, various approaches including trimming, filtering and/or contaminant removal have been proposed in literature (Cao et al.,2021, Reitmeier et al., 2021, Nikodemova et al., 2023, Davis et al., 2018).

NG-Tax 2.0 follows an error-handling strategy which filters artifacts and corrects for the impact of error-reads on the relative abundance estimates. By default, a user defined relative abundance cut-off is set to 0.1% of the total number of read-pairs associated with ASVs (Poncheewin et al., 2019). Thus, the output of NG-Tax 2.0 does not need additional filtering.

Depending on the origin of the sample and/or the aim of the study, NG-Tax 2.0 parameters can be changed. In studies that aim to discover rare taxa or is expected that rare taxa contribute to a larger diversity, filtering would not be advisable.

Before making any filtering decisions, a nice way to check the taxa distribution in your data set is to visualize the prevalence-abundance distribution of ASVs. Keep in mind that the minimum detection threshold of NG-Tax is set at 0.1%.

ps_filt_fixed_samples<- ps_filt_fixed %>%
  ps_filter(!`observation.unit.identifier` %in% c("Mock_Community", "Negative_Control"))

taxa_prev_plot <- plot_taxa_prevalence(ps_filt_fixed_samples, "Phylum",
                                       detection = 0)
taxa_prev_plot

Check for singletons and doubletons

Another common practice when working with this type of data is to check the number of ASVs represented by only one (singletons) or two (doubletons) sequences respectively, across all samples analyzed.

NG-Tax 2.0 error-handling assumes that singleton read-pairs that did not match existing ASV object due to a single mismatch (mutation or indel) is probably due to a random sequence error and is joined with the particular ASV. Singletons showing more than one mismatch are considered as sample specific noise and discarded.

length(which(taxa_sums(ps_filt_fixed_samples) <= 1))

## [1] 0

length(which(taxa_sums(ps_filt_fixed_samples) <= 2))

## [1] 0

In this dataset, there are no singletons (because NG-Tax filters them out) or doubletons so there is no need for filtering.

In case other bioinformatic pipelines were used for ASV generation, different approaches on how to deal with singletons and/or doubletons might be necessary before downstream analysis.

Find out most or least abundant taxa in dataset

Let’s inspect the total abundance of each taxon (ASV) across all samples in the dataset. This is achieved with the taxa_sums function. To find out which taxa are the least abundant across the dataset, use the following code.

taxa_sums(ps_filt_fixed_samples) %>%
  data.frame() %>%
  dplyr::rename(sum = ".") %>%
  arrange(sum)%>%
  head(n=20) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "200px", height = "300px")

sum
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5

Data normalization

Microbiome data is inherently compositional- meaning that sequencing only provides information on the relative abundance of taxa and that the observed abundance of each taxon is dependent on the observed abundances of all other taxa- and this poses challenges to downstream statistical analysis. To address this, normalization is employed to scale the data to a common and comparable level. Thanks to data normalization, the differential abundance analysis is not confounded by the differences in the sampling fractions (which is unknown and differs per sample). Failure to normalize the data will result in a systematic bias that increases the false discovery rate (FDR) and also possible loss of power in some cases.

Below we show two methods of normalization, however other common approaches for normalization (such as scaling of counts, parameter estimation, etc.) may be applied. as exists. To know more, here is a nice overview of the available normalization methods and how they perform on comparing ecological communities by McKnight et al (2019) and by Lin & Peddada (2020).

Total-sum scaling

A common method is Total-Sum Scaling (TSS) (or compositional) which transforms the absolute ASV counts into relative abundance values. For more information, check out the articles: Zhou et al. (2023) and Weiss et al. (2017).

ps_filt_fixed_samples_rel<- tax_transform(ps_filt_fixed_samples, trans = "compositional")
head(ps_filt_fixed_samples_rel@otu_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

	1_CW1_A_D50_1	1_CW1_A_D50_2	1_CW1_A_D90_1	1_CW2_A_D70_1	1_CW3_A_D70_1	1_CW3_A_D70_2	1_CW3_A_D93_1	1_CW3_A_D93_2	1_CW3_D_D93_1	2_CW1_A_50_1	2_CW1_A_50_2	2_CW1_A_90_1	2_CW1_A_90_2	2_CW1_D_50_2	2_CW3_D_90_2
1632942152	0.000000000	0.000000000	0.000000000	0.000000000	0.00116556	0.001072578	0.00000000	0.00000000	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
1632941008	0.000000000	0.000000000	0.001587638	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
1632942677	0.000000000	0.000000000	0.000000000	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.007094396	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
1632941543	0.000000000	0.000000000	0.000000000	0.001680672	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
1632942189	0.000000000	0.000000000	0.000000000	0.000000000	0.00000000	0.002681444	0.00349239	0.00150585	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
1632943887	0.000000000	0.000000000	0.000000000	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.001646132
1632942431	0.000000000	0.000000000	0.000000000	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.001100968	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
163294897	0.000000000	0.000000000	0.011113463	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000	0.001607518	0.01642879	0.01381361	0.001411433	0.000000000
163294951	0.000000000	0.000000000	0.002434378	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000
163294410	0.002788586	0.001594434	0.000000000	0.000000000	0.00000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000	0.000000000	0.00000000	0.00000000	0.000000000	0.000000000

Rarefaction

Another method of data normalization is rarefaction. It addresses another challenge with 16S sequenced microbial datasets, which is the variation between the total number of sequencing reads generated for each sample. Rarefaction is random subsampling to an even sampling depth. Sampling is usually done without replacement. Rare sequences are excluded more often when sampling with replacement (Cameron et al., 2021). Rarefaction controls the effect of uneven sequencing effort (library size) across samples in a study. This normalization method is widely used, however not always advised and for example authors of phyloseq R package do not advocate using this as a normalization procedure (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531). There are also opposite views, as described in this article. This kind of normalization might be dataset specific, as described here, therefore it should be decided cautiously whether to rarefy or not. Given the fact that this method is highly debatable, consider applying it only in cases when there is highly uneven sequencing depth across libraries. The following part indicates how to apply rarefaction and guide the user through decision making steps. However, for the purpose of this tutorial it was decided to proceed with non-rarefied data for downstream analysis.

Decision Making

First, the distribution of total sequencing reads in the different samples has to be assessed. The scatter plot in the Quality Control section shows the total number of reads per sample. The variation in total reads sequenced between samples may introduce a bias when studying microbial community composition/diversity. Rarefying samples with high read counts to a certain read depth may alleviate the limitation imposed by total read count variation between samples.To make an educated decision on the rarefaction threshold, let’s explore the data through plotting rarefaction curves and histograms of read sums.

ps_filt_fixed_samples %>%
  ggrare(step = 50, color = "observation.unit.identifier", se = FALSE)

sum_of_reads %>% filter(!sampleID %in% c("refmock3", "refmock4")) %>%
  ggplot(aes(x=sum_of_reads)) +
  geom_histogram(color = 'black', binwidth = 500, aes(fill = observation.unit.identifier))+
  scale_fill_nejm()+
  ggtitle("Histogram of Read Sums (binwidth = 500)")+
  labs(x = "Read sums")+
  theme_bw()+
  theme(legend.position = "bottom", legend.title = element_blank())

sum_of_reads %>% filter(!sampleID %in% c("refmock3", "refmock4")) %>%
  filter(sum_of_reads<20000)%>%
  ggplot(aes(x=sum_of_reads)) +
  geom_histogram(color = 'black', binwidth = 1000, aes(fill = observation.unit.identifier))+
  scale_fill_nejm()+
  ggtitle("Histogram of Read Sums < 20000 Reads (binwidth = 500)")+
  labs(x = "Read sums")+
  theme_bw()+
  theme(legend.position = "bottom", legend.title = element_blank())

sum_of_reads %>% filter(!observation.unit.identifier %in% c("Mock_Community", "Negative_Control")) %>%
  bind_cols(estimate_richness(ps_filt_fixed_samples, measures = "Observed")) %>%
  remove_rownames() %>%
  select(sampleID, sum_of_reads, Observed) %>% arrange(sum_of_reads, Observed) %>% 
  kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

sampleID	sum_of_reads	Observed
1_CW2_D_D93_1	4686	316
2_CW3_D_50_2	5829	395
1_CWI_D_D90_2	6018	459
1_CWI_D_D70_2	6614	385
2_CW1_A_50_2	8087	403
2_CW3_A_90_1	9219	272
2_CW1_D_70_1	9434	397
1_CW1_A_D90_1	9448	280
2_CW1_D_50_2	9919	330
2_CW2_D50_1	10470	392
1_CW3_A_D70_2	11188	339
2_CW1_A70_1	12823	432
1_CW3_D_D50_1	12976	243
2_CW3_A_70_2	13035	272
1_CW3_D_D93_2	13140	282
1_CW1_A_D50_2	13798	397
1_CW3_D50_1	14322	256
1_CW1_A_D70_2	14898	388
2_CW2_A_90_1	15009	196
1_CW1_A_D50_1	15420	389
2_CW3_A_50_2	15618	391
1_CW3_D_D70_1	15896	307
1_CW2_A_D50_2	15992	159
1_CW1_D_D50_1	16149	331
1_CW2_D_D70_1	16410	277
1_CW3_A_D50_1	16419	301
1_CW3_A_D93_2	17266	309
2_CW2_D_70_1	17323	384
1_CWI_D_D90_1	17329	416
2_CW2_D_50_1	17597	377
1_CWI_D_D70_1	18044	404
1_CW1_D_D30_1	18635	414
1_CW3_D_D93_3	18829	281
1_CW2_D_D50_1	18902	269
1_CW3_D70_1	19089	289
1_CW3_A_D70_1	19733	375
2_CW3_D_90_2	20047	174
1_CW2_A_D70_1	20230	272
2_CW1_A_90_2	20849	153
1_CW3_D_D93_1	21799	292
2_CW3_A_50_1	22419	402
2_CW3_D_70_1	22746	297
2_CW1_A_90_1	23191	152
1_CW2_A_D93_2	23318	324
2_CW2_A_50_2	23786	295
1_CW1_A_D30_1	24377	368
2_CW2_D_70_2	24403	282
2_CW1_D90_1	24538	193
1_CW3_A_D50_2	24689	295
2_CW3_A_90R_1	24836	279
1_CW1_A_D70_1	26491	342
1_CW3_A_D93_1	27202	293
1_CW2_A_D93_1	28109	307
2_CW1_D_50_1	28328	302
2_CW2_A_70_1	29918	253
2_CW1_A_50_1	32279	332
2_CW3_D_90_1	32684	186
2_CW3_A_70_1	33363	271
2_CW3_D_50_1	33804	383
2_CW2_A_50_1	48326	287

The table above shows a summary of taxa richness (number of Observed ASVs) and total reads per sample. To choose the rarefaction depth the safest approach it the smallest sequencing depth, in the case of this dataset is the read sum of sample 1_CW2_D_D93_1 (324 ASVs).

Plotting rarefaction curves illustrates that the sample 1_CW2_A_D50_2 is the least ASV-rich sample (161 ASVs) with 15,997 reads in total.

Plotting histograms, and inspecting read sum distribution illustrates that there is a gap in read sum distribution between 13,024 and 11,411. Therefore 13,024 emerges as a good cut-off for rarefaction.

On one hand, cutoffs 5,198 and 13,024 can attenuate the ASV richness variation between high read count samples and the least ASV rich sample “1_CW2_A_D50_2” due to library size variation. On the other hand, 5,198 could be a too conservative rarefaction cutoff leading to loss of species richness in samples with higher read counts, but with less ASV richness, such as the sample 1_CW2_A_D50_2 with 15,997 reads. As a result, we decide to use 13,024 reads as our rarefaction cut-off.

Rule of thumb: Rarefy to a depth that retains most of the data without sacrificing too much biological information (samples and/or ASVs).

Rarefying

The code below rarefies the dataset to a depth of 13,024 reads and plots rarefaction curves.

ps_filt_fixed_samples_raref<- rarefy_even_depth(ps_filt_fixed_samples, sample.size = 13024, replace = FALSE)

ps_filt_fixed_samples_raref %>% ggrare(step = 50, color = "observation.unit.identifier", se = FALSE)

## rarefying sample 1_CW1_A_D30_1

## rarefying sample 1_CW1_A_D50_1

## rarefying sample 1_CW1_A_D50_2

## rarefying sample 1_CW1_A_D70_1

## rarefying sample 1_CW1_A_D70_2

## rarefying sample 1_CW1_D_D30_1

## rarefying sample 1_CW1_D_D50_1

## rarefying sample 1_CW2_A_D50_2

## rarefying sample 1_CW2_A_D70_1

## rarefying sample 1_CW2_A_D93_1

## rarefying sample 1_CW2_A_D93_2

## rarefying sample 1_CW2_D_D50_1

## rarefying sample 1_CW2_D_D70_1

## rarefying sample 1_CW3_A_D50_1

## rarefying sample 1_CW3_A_D50_2

## rarefying sample 1_CW3_A_D70_1

## rarefying sample 1_CW3_A_D93_1

## rarefying sample 1_CW3_A_D93_2

## rarefying sample 1_CW3_D_D70_1

## rarefying sample 1_CW3_D_D93_1

## rarefying sample 1_CW3_D_D93_2

## rarefying sample 1_CW3_D_D93_3

## rarefying sample 1_CW3_D50_1

## rarefying sample 1_CW3_D70_1

## rarefying sample 1_CWI_D_D70_1

## rarefying sample 1_CWI_D_D90_1

## rarefying sample 2_CW1_A_50_1

## rarefying sample 2_CW1_A_90_1

## rarefying sample 2_CW1_A_90_2

## rarefying sample 2_CW1_D_50_1

## rarefying sample 2_CW1_D90_1

## rarefying sample 2_CW2_A_50_1

## rarefying sample 2_CW2_A_50_2

## rarefying sample 2_CW2_A_70_1

## rarefying sample 2_CW2_A_90_1

## rarefying sample 2_CW2_D_50_1

## rarefying sample 2_CW2_D_70_1

## rarefying sample 2_CW2_D_70_2

## rarefying sample 2_CW3_A_50_1

## rarefying sample 2_CW3_A_50_2

## rarefying sample 2_CW3_A_70_1

## rarefying sample 2_CW3_A_70_2

## rarefying sample 2_CW3_A_90R_1

## rarefying sample 2_CW3_D_50_1

## rarefying sample 2_CW3_D_70_1

## rarefying sample 2_CW3_D_90_1

## rarefying sample 2_CW3_D_90_2

Data transformation

Normalized data still exhibit skewed distributions, unequal variances, and extreme values, which may limit their effectiveness in situations with significant heterogeneity. Therefore, transforming data might be necessary before performing certain statistical tests (e.g. differential abundance analysis). The selection of the appropriate transformation method depends on the chosen statistical methods. It also helps when illustrating patterns involving both low and high abundance taxa. There are different types of data transformations, such as additive log-ratio (alr), centered log-ratio (clr) and robust centered log-ratio (rclr).

ps_filt_fixed_samples_rel_rclr<- tax_transform(ps_filt_fixed_samples_rel, trans = "rclr", chain = TRUE)
head(ps_filt_fixed_samples_rel_rclr@otu_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")

	1_CW1_A_D50_1	1_CW1_A_D50_2	1_CW1_A_D90_1	1_CW2_A_D70_1	1_CW3_A_D70_1	1_CW3_A_D70_2	1_CW3_A_D93_1	1_CW3_A_D93_2	1_CW3_D_D93_1	2_CW1_A_50_1	2_CW1_A_50_2	2_CW1_A_90_1	2_CW1_A_90_2	2_CW1_D_50_2	2_CW3_D_90_2
1632942152	0.0000000	0.0000000	0.0000000	0.0000000	-0.5943949	-0.7687571	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000
1632941008	0.0000000	0.0000000	-0.5779372	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000
1632942677	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	1.051638	0.0000000	0.000000	0.000000	0.00000	0.0000000
1632941543	0.0000000	0.0000000	0.0000000	-0.3715704	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000
1632942189	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.1475337	0.3636965	-0.4370147	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000
1632943887	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	-0.5417619
1632942431	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	-0.7354975	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000
163294897	0.0000000	0.0000000	1.3679730	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	-0.2220174	1.301646	1.148943	-0.54112	0.0000000
163294951	0.0000000	0.0000000	-0.1504932	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000
163294410	0.2871654	-0.2571683	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000	0.000000	0.00000	0.0000000

Aggregating data at specific taxonomic rank

It is possible to group the ASVs in the phyloseq object at different taxonomic ranks by aggregating the counts and inspect the data depending on your research question.

ps_filt_fixed_samples_phylum<- tax_agg(ps_filt_fixed_samples, rank = "Phylum")
print(ps_filt_fixed_samples_phylum)

## psExtra object - a phyloseq object with extra slots:
## 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 45 taxa and 60 samples ]
## sample_data() Sample Data:       [ 60 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 45 taxa by 2 taxonomic ranks ]
## 
## psExtra info:
## tax_agg = "Phylum"

To aggregate ASV counts at other taxonomic levels, set the desired rank and create a summary of the number of taxa at the different ranks.

ps_filt_fixed_samples_class<- tax_agg(ps_filt_fixed_samples, rank = "Class")
ps_filt_fixed_samples_order<- tax_agg(ps_filt_fixed_samples, rank = "Order")
ps_filt_fixed_samples_family<- tax_agg(ps_filt_fixed_samples, rank = "Family")
ps_filt_fixed_samples_genus<- tax_agg(ps_filt_fixed_samples, rank = "Genus")

summary_agg<- data.frame(Taxa = c(colnames(ps_filt_fixed_samples@tax_table)[c(2:6)], "ASVs"),
                 number_of_taxa=c(dim(ps_filt_fixed_samples_phylum@otu_table)[1], dim(ps_filt_fixed_samples_class@otu_table)[1], dim(ps_filt_fixed_samples_order@otu_table)[1], dim(ps_filt_fixed_samples_family@otu_table)[1], dim(ps_filt_fixed_samples_genus@otu_table)[1], dim(ps_filt_fixed_samples@otu_table)[1]))
print(summary_agg)

##     Taxa number_of_taxa
## 1 Phylum             45
## 2  Class            129
## 3  Order            291
## 4 Family            466
## 5  Genus            741
## 6   ASVs           3772

Downstream analysis

Data exploration

It is always good idea to visualize all of the samples and quickly verify whether everything looks as expected (no strange results, samples with mainly one taxon present, if we do not expect it)

Here, we will explore microbial community composition (at order level) grouped by season and observation unit.

To vizualize data very often barplots and heatmaps showing individual samples are used.

In these barplots, the depth of sampling is shown on x axis.

barplots_season<- ps_filt_fixed_samples%>%
  ps_mutate(season = as.factor(season)) %>%
  ps_mutate(season = fct_relevel(season, c("winter", "summer"))) %>%
  ps_arrange(season, depth)%>%
  comp_barplot(group_by = "season", tax_level = "Order", tax_transform_for_plot = "compositional", n_taxa = 25, bar_width = 0.8, merge_other = FALSE, label = "depth", sample_order = "default")

barplots_season_wrapped<-wrap_plots(barplots_season) +
  plot_layout(guides = 'collect', nrow = 2)

barplots_season_wrapped&
  facet_wrap("observation.unit.identifier", scales = "free_x")&
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 10), axis.ticks.y = element_blank(), legend.title = element_blank(), legend.position = "bottom", strip.text = element_text(size=12),
        legend.key.size = unit(0.4, 'cm'),
        legend.text = element_text(size=10))

On the heatmap below all samples and all taxa are plotted. The color indicates the abundance of taxa. Similar samples based on their taxa composition are clustered together (based on hierarchical clustering). The size of the heatmap was too big because taxa were too many to visualize and for this we applied an additional step of filtering based on prevalence. Only taxa belonging to an order that have over 100 counts in at least 5% of samples are visualized. This heatmap gives also the possibility to annotate samples and taxa with additional information.

cols_oui<- pal_nejm("default")(3)
names(cols_oui) <- unique(samdat_tbl(ps_filt_fixed_samples)$`observation.unit.identifier`)
cols_season<- c("lightblue", "yellow")
names(cols_season) <- unique(samdat_tbl(ps_filt_fixed_samples)$season)
cols_location<- c("black", "grey")
names(cols_location) <- unique(samdat_tbl(ps_filt_fixed_samples)$location)
cols_depth<- brewer.pal(5, name = "YlOrRd")
names(cols_depth) <- unique(samdat_tbl(ps_filt_fixed_samples)$depth)

ps_filt_fixed_samples %>%
  tax_transform("compositional", rank = "Order") %>%
  tax_filter(min_prevalence = 0.1, undetected = 500, tax_level = "Order") %>%
  comp_heatmap(colors = heat_palette(sym = FALSE, "Blue-Red 3"),
               tax_anno = taxAnnotation(
      Prev. = anno_tax_prev(bar_width = 0.3, size = grid::unit(1, "cm"))
    ),
    sample_anno = sampleAnnotation(
      Depth = anno_sample("depth"),
      Season = anno_sample("season"),
      Location = anno_sample("location"),
      Group = anno_sample("observation.unit.identifier"),
      col = list(Group = cols_oui, Season = cols_season, Location = cols_location, Depth = cols_depth), border = FALSE))

Alpha diversity calculation

Another debatable topic is whether to calculate alpha-diversity metrics on rarefied or non-rarefied data. Richness-based metrics like Observed ASVs, phylogenetic diversity (PD) and Chao1, as well as Shannon (Schloss PD, 2024) are sensitive to sequencing depth variations, because one sample might show higher number of ASVs just because it was sequenced more which makes rarefaction necessary. Considering these limitations, we decided to perform calculations on non-rarefied data and focus on metrics less sensitive to sequencing dept variations. Here, we calculated 4 indices that depend on abundance/prevalence of taxa and 1 index which incorporates phylogenetic differences between species (Faith’s phylogenetic diversity).

alpha_div<- estimate_richness(ps_filt_fixed_samples, measures = c("Observed", "Shannon", "Simpson", "InvSimpson"))
PD<- calculatePD(ps_filt_fixed_samples, justDF = TRUE) %>% pull(PD)
alpha_div<- alpha_div %>% mutate(PD = PD)
rownames(alpha_div) <- gsub("^X", "", rownames(alpha_div))
head(alpha_div, n=20) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "700px", height = "300px")

	Observed	Shannon	Simpson	InvSimpson	PD
1_CW1_A_D30_1	368	5.612523	0.9944924	181.56645	29.64795
1_CW1_A_D50_1	389	5.715326	0.9953557	215.31554	34.68714
1_CW1_A_D50_2	397	5.740467	0.9955132	222.87770	34.88421
1_CW1_A_D70_1	342	5.366343	0.9900399	100.40064	30.88764
1_CW1_A_D70_2	388	5.588234	0.9934239	152.06686	35.20985
1_CW1_A_D90_1	280	5.340239	0.9926522	136.09582	28.01250
1_CW1_D_D30_1	414	5.789785	0.9957194	233.61273	30.77747
1_CW1_D_D50_1	331	4.931737	0.9684862	31.73210	29.92374
1_CW2_A_D50_2	159	4.173833	0.9688446	32.09715	15.26653
1_CW2_A_D70_1	272	5.100249	0.9894741	95.00413	24.87157
1_CW2_A_D93_1	307	4.829395	0.9666682	30.00135	30.57384
1_CW2_A_D93_2	324	4.857781	0.9655586	29.03483	31.31262
1_CW2_D_D50_1	269	5.040975	0.9855748	69.32308	24.73393
1_CW2_D_D70_1	277	5.054119	0.9866257	74.77035	24.42412
1_CW2_D_D93_1	316	5.514564	0.9947579	190.76185	28.42853
1_CW3_A_D50_1	301	5.254104	0.9905444	105.75699	27.85874
1_CW3_A_D50_2	295	5.216545	0.9901605	101.63087	26.21226
1_CW3_A_D70_1	375	5.648368	0.9951015	204.14372	29.86439
1_CW3_A_D70_2	339	5.530080	0.9942026	172.49120	29.44992
1_CW3_A_D93_1	293	5.244995	0.9910261	111.43475	28.09640

To compare the alpha-diversity between groups, we first need to attach metadata to the table and perform statistics using the code below.

metadata_samples<- metadata %>%filter(!`observation.unit.identifier` %in% c("Mock_Community", "Negative_Control"))
metadata_samples_alpha<- bind_cols(metadata_samples, alpha_div)
metadata_samples_long<- metadata_samples_alpha%>% pivot_longer(cols= Observed:PD, names_to = "alpha_div", values_to = "value")
kruskal_test_alpha<- metadata_samples_long %>% group_by(alpha_div) %>%
    rstatix::kruskal_test(value ~ `observation.unit.identifier`) %>% mutate(p_adj_fdr = p.adjust(p, method = "fdr"))
kruskal_test_alpha %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12)

alpha_div	.y.	n	statistic	df	p	method	p_adj_fdr
InvSimpson	value	60	10.095457	2	0.00642	Kruskal-Wallis	0.0101625
Observed	value	60	9.624578	2	0.00813	Kruskal-Wallis	0.0101625
PD	value	60	8.232205	2	0.01630	Kruskal-Wallis	0.0163000
Shannon	value	60	10.783981	2	0.00455	Kruskal-Wallis	0.0101625
Simpson	value	60	10.095457	2	0.00642	Kruskal-Wallis	0.0101625

Based on the Kruskal-Wallis test results, alpha-diversity differs significantly between groups for all of the indices. Using the following code, post-hoc test (Dunn’s test) can be performed to look for pairwise differences in alpha-diversity between groups.

dunn_test_alpha<- metadata_samples_long %>% group_by(alpha_div) %>%
    rstatix::dunn_test(value ~ observation.unit.identifier, p.adjust.method = "fdr") %>%
  add_xy_position(x = "observation.unit.identifier", fun = "mean_sd", step.increase = 0.05, scales = "free")
dunn_test_alpha %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12)

alpha_div	.y.	group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif	y.position	groups	xmin	xmax
InvSimpson	value	CW1-sand	CW2-biochar	21	15	-2.9390920	0.0032918	0.0098753	**	250.590550	CW1-sand….	1	2
InvSimpson	value	CW1-sand	CW3-biochar	21	24	-2.4501016	0.0142816	0.0214224		257.438875	CW1-sand….	1	3
InvSimpson	value	CW2-biochar	CW3-biochar	15	24	0.7944532	0.4269316	0.4269316	ns	264.287200	CW2-bioc….	2	3
Observed	value	CW1-sand	CW2-biochar	21	15	-2.6256333	0.0086488	0.0129732		437.178100	CW1-sand….	1	2
Observed	value	CW1-sand	CW3-biochar	21	24	-2.6950691	0.0070374	0.0129732		442.933750	CW1-sand….	1	3
Observed	value	CW2-biochar	CW3-biochar	15	24	0.2501066	0.8025049	0.8025049	ns	448.689400	CW2-bioc….	2	3
PD	value	CW1-sand	CW2-biochar	21	15	-2.4309614	0.0150588	0.0225882		34.853150	CW1-sand….	1	2
PD	value	CW1-sand	CW3-biochar	21	24	-2.4900241	0.0127734	0.0225882		35.186375	CW1-sand….	1	3
PD	value	CW2-biochar	CW3-biochar	15	24	0.2363063	0.8131950	0.8131950	ns	35.519600	CW2-bioc….	2	3
Shannon	value	CW1-sand	CW2-biochar	21	15	-3.0294263	0.0024502	0.0073506	**	5.918350	CW1-sand….	1	2
Shannon	value	CW1-sand	CW3-biochar	21	24	-2.5459156	0.0108992	0.0163487		5.947375	CW1-sand….	1	3
Shannon	value	CW2-biochar	CW3-biochar	15	24	0.8002522	0.4235647	0.4235647	ns	5.976400	CW2-bioc….	2	3
Simpson	value	CW1-sand	CW2-biochar	21	15	-2.9390920	0.0032918	0.0098753	**	0.999150	CW1-sand….	1	2
Simpson	value	CW1-sand	CW3-biochar	21	24	-2.4501016	0.0142816	0.0214224		0.999375	CW1-sand….	1	3
Simpson	value	CW2-biochar	CW3-biochar	15	24	0.7944532	0.4269316	0.4269316	ns	0.999600	CW2-bioc….	2	3

The code below plots the alpha-diversity values and the post-hoc pairwise test results.

ggplot(metadata_samples_long, aes(x = observation.unit.identifier, y = value))+
  geom_boxplot()+
  geom_jitter(shape = 21, color = "black", size = 3, aes(fill = observation.unit.identifier))+
  facet_wrap(~alpha_div, scales = "free", ncol = 3)+
  scale_fill_nejm()+
  stat_pvalue_manual(data = dunn_test_alpha,
                     label = "p.adj.signif", tip.length = 0.01, bracket.shorten = 0.05, hide.ns = TRUE, size = 4)+
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  theme_bw()+
  theme(axis.text.x = element_text(size = 10), legend.position = "none")

Alpha-diversity is statistically significantly higher for CW1-samples compared to the other two groups. As mentioned above, these indices are sensitive to uneven sequencing depth. For this, interpretation should happen with great caution considering only diversity indices non-sensitive to variations in sequencing depth.

Beta-diversity analysis

To measure how distinct or similar microbial communities are across different environments, conditions or treatments, beta-diversity analysis is performed where several distance or dissimilarity metrics tailored to capture specific data characteristics are used. Common beta-diversity metrics: Bray-Curtis dissimilarity (compositional), Jaccard index (compositional)and Unifrac (phylogenetic), each emphasizing different aspects of community composition (presence/absence, abundance, phylogenetic).

The code below calculates the distance matrix using Bray-Curtis and every value in this matrix represents the pairwise dissimilarity between two samples. To reduce the complexity of the distance matrix and represent sample relationships in lower-dimensional space, Principal Coordinate Analysis (PCoA) was performed. Statistical testing (PERMANOVA) followed to assess whether the observed patterns/differences are statistically significantly associated with analyzed groups from metadata. PERMANOVA tests for differences in the centroid (average position) of groups, indicating whether groups have different overall compositions.

PCoA_ASV_bray <- ordinate(ps_filt_fixed_samples_rel, distance = "bray", method = "PCoA")
beta_ASV_bray <- phyloseq::distance(ps_filt_fixed_samples_rel, "bray")
ado_bray_group<- adonis2(beta_ASV_bray ~ observation.unit.identifier, metadata_samples)
print(ado_bray_group)

## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = beta_ASV_bray ~ observation.unit.identifier, data = metadata_samples)
##          Df SumOfSqs      R2      F Pr(>F)    
## Model     2   2.6198 0.13162 4.3199  0.001 ***
## Residual 57  17.2839 0.86838                  
## Total    59  19.9038 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PERMANOVA assumes that the dispersion within groups is homogeneous, i.e. the spread of distances from the group centroid is similar across all groups. Unequal dispersions influence the interpretation of PERMANOVA results. To confirm that the observed differences in PERMANOVA are not simply due to one group having more variability in composition, it is a good practice to check for the homogenity of dispersions. For this, a beta-dispersion test will be performed. However, according to a paper by Anderson and Walsh (2013) it is most important if the sample sizes are not equal. If they are, the dispersion might be heterogenous and one could still interpret PERMANOVA results.

group.bd<- betadisper(d = beta_ASV_bray, group = metadata_samples$observation.unit.identifier, type="centroid")
boxplot(group.bd)

permutest(group.bd)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df  Sum Sq   Mean Sq      F N.Perm Pr(>F)
## Groups     2 0.01497 0.0074843 0.6005    999  0.573
## Residuals 57 0.71044 0.0124638

Dispersions do not differ between groups, therefore PERMANOVA results can be interpreted with confidence.

The code below shows how to plot results from PCoA analysis. Here a confidence ellipsis with mean values (plotted as diamonds) is added. This kind of ellipsis shows where are the mean results in each group and it also shows the dispersion of the samples around the mean.

plot_ordination(ps_filt_fixed_samples_rel, PCoA_ASV_bray, axes = c(1,2))+
  stat_conf_ellipse(geom = "polygon", alpha = 0.1, aes(color = observation.unit.identifier,  fill = observation.unit.identifier))+
  geom_point(size = 3, shape = 21, aes(fill = observation.unit.identifier))+
  stat_mean(aes(fill = observation.unit.identifier), shape = 23, size = 4) + 
  annotate("text", x =- 0.3, y = 0.5, color = "black", label = paste0("p=", round(ado_bray_group$`Pr(>F)`[1], digits = 3)))+
  annotate("text", x = -0.3, y = 0.45, color = "black", label = paste0("R\u00b2=", round((ado_bray_group$R2[1])*100, digits = 2), "%"))+
  scale_fill_nejm(name = "observation\nunit")+
  scale_color_nejm(name = "observation\nunit")+
  theme_bw()+
  theme(legend.position = "right")

Following the above example, beta-diversity analysis is performed using Weighted Unifrac distance metric which accounts for both phylogenetic relationships between taxa and their relative abundance.

PCoA_ASV_wunifrac <- ordinate(ps_filt_fixed_samples_rel, distance = "wunifrac", method = "PCoA")
beta_ASV_wunifrac <- phyloseq::distance(ps_filt_fixed_samples_rel, "wunifrac")
ado_wunifrac_group<- adonis2(beta_ASV_wunifrac ~ observation.unit.identifier, metadata_samples)
print(ado_wunifrac_group)

## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = beta_ASV_wunifrac ~ observation.unit.identifier, data = metadata_samples)
##          Df SumOfSqs      R2      F Pr(>F)    
## Model     2  0.10444 0.13845 4.5799  0.001 ***
## Residual 57  0.64992 0.86155                  
## Total    59  0.75436 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the dispersion also needs to be checked

group_wunifrac.bd<- betadisper(d = beta_ASV_wunifrac, group = metadata_samples$observation.unit.identifier, type="centroid")
boxplot(group_wunifrac.bd)

permutest(group_wunifrac.bd)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df   Sum Sq    Mean Sq      F N.Perm Pr(>F)
## Groups     2 0.000482 0.00024084 0.1744    999  0.841
## Residuals 57 0.078723 0.00138110

The dispersion does not differ therefore conclusions can be drawn from PERMANOVA test

plot_ordination(ps_filt_fixed_samples_rel, PCoA_ASV_wunifrac, axes = c(1,2))+
  stat_conf_ellipse(geom = "polygon", alpha = 0.1, aes(color = observation.unit.identifier,  fill = observation.unit.identifier))+
  geom_point(size = 3, shape = 21, aes(fill = observation.unit.identifier))+
  stat_mean(aes(fill = observation.unit.identifier), shape = 23, size = 4) + 
  annotate("text", x = 0.1, y = -0.15, color = "black", label = paste0("p=", round(ado_wunifrac_group$`Pr(>F)`[1], digits = 3)))+
  annotate("text", x = 0.1, y = -0.13, color = "black", label = paste0("R\u00b2=", round((ado_wunifrac_group$R2[1])*100, digits = 2), "%"))+
  scale_fill_nejm(name = "observation\nunit")+
  scale_color_nejm(name = "observation\nunit")+
  theme_bw()+
  theme(legend.position = "right")

Differential abundance analysis (DAA)

A common practice in microbiome analysis is to identify which taxa significantly differ between sample groupings, conditions or treatments applied in the respective study. There are different ways to perform differential abundance testing with microbiome data. The researcher needs to decide which method suits the project aim the best. In the article by Nearing et al. (2022) the performance of different testing methods for DAA is compared. Among the tools showing the most consistent results was MaAsLin2 (Multivariate Association with Linear Models 2) and below we show how to use the R package. The output is generated into separate folders and files.

MaAsLin3 is also available for use and its tutorial can be found here.

Data Preprocessing: Filtering by prevalence

There are several statistical challenges in analyzing microbiome data due to its unique properties that have led to the development of different approaches to ensure statistically reliable results (Yang and Chen, 2022).

Among others, one question raised when processing data before DAA is whether rare taxa should be filtered out. Filtering out potentially uninformative features before running statistical tests could help address the problem of statistical power reduction due to multiple test correction. Nearing et al.(2022). However, data filtering could have unexpected effects in some cases such as false positives increases (type I errors) Bourgon et al. (2010). This is another debatable topic when someone deals with high-dimensional data.

Given that sequencing data of microbial communities are compositional, filtering out features directly affects the relative abundance of each taxon. Nevertheless, it is reasonable to exclude less prevalent features prior to DA testing because their presence might prohibit identification of significant ASVs between conditions by the DA tools.

Choosing the optimal prevalence filtering cut-off involves certain exploratory steps. The following code will keep only taxa with a prevalence of 5% (taxa present in min. 5% of samples).

ps_filt_fixed_samples_prev<- tax_filter(ps_filt_fixed_samples,
  min_prevalence = 0.05)
  
print(ps_filt_fixed_samples_prev)

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1761 taxa and 60 samples ]
## sample_data() Sample Data:       [ 60 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 1761 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1761 tips and 1760 internal nodes ]

In this function other parameters can be also set, such as prevalence detection threshold (prev_detection_threshold). Therefore for example to filter taxa with specific minimum abundance (eg. 0.25%) in a given amount of samples (eg. two), the code would be as follows: tax_filter(ps_filt_fixed_samples, min_prevalence = 2, prev_detection_threshold = 0.0025)

Let’s compare the number of taxa in filtered and unfiltered phyloseq object.

The number of taxa before filtering:

ntaxa(ps_filt_fixed_samples)

## [1] 3772

The number of taxa after filtering:

ntaxa(ps_filt_fixed_samples_prev)

## [1] 1761

The number of removed taxa:

ntaxa_removed <- ntaxa(ps_filt_fixed_samples)-ntaxa(ps_filt_fixed_samples_prev)
print(ntaxa_removed)

## [1] 2011

The number of removed taxa as proportion of all taxa:

ntaxa_removed*100/ntaxa(ps_filt_fixed_samples)

## [1] 53.31389

When applying a 5% prevalence threshold, more than 50% of the taxa are removed. This means that almost half of the taxa were present in at least 5% of the samples. Taxa appearing in less than 5% of the samples were filtered out.

To check what happens to the number of taxa when less strict prevalence thresholds are applied, use the following code:

prev_thresholds <- c(0.05, 0.02, 0.01)

prev_filt_results <- data.frame(min_prevalence = prev_thresholds, num_taxa_retained = NA, perc_taxa_removed = NA)

total_taxa <- nrow(tax_table(ps_filt_fixed_samples))

for (i in seq_along(prev_thresholds)) {
  ps_filt_fixed_samples_prev_all <- ps_filt_fixed_samples %>%
    tax_filter(min_prevalence = prev_thresholds[i])
  
  num_taxa_retained <- nrow(tax_table(ps_filt_fixed_samples_prev_all))
  perc_taxa_removed <- 100 * (total_taxa - num_taxa_retained) / total_taxa
  
  prev_filt_results$num_taxa_retained[i] <- num_taxa_retained
  prev_filt_results$perc_taxa_removed[i] <- perc_taxa_removed
}

print(prev_filt_results)

##   min_prevalence num_taxa_retained perc_taxa_removed
## 1           0.05              1761          53.31389
## 2           0.02              2480          34.25239
## 3           0.01              3772           0.00000

As mentioned above, depending on the research question, the filtering threshold could be an important consideration. For the purpose of the present analysis, we will focus on more common taxa across the samples and thus, we will apply minimum prevalence of 5%.

MaAsLin2

Since 16S rRNA gene sequencing has limited resolution at species level, usually genus level is used. However in case of this dataset many genera were not annotated, therefore we will use the highest rank with the most annotated taxa, which was order level. MaAsLin2 requires input files (feature abundance and metadata) in specific format. For more information on how to run it, check out the link. The code below will prepare the input data which needs to be a data frame. Therefore, the ASV table needs to be extracted from the phyloseq object. Because a phyloseq object grouped at order level will be used - taxonomy is already added as row names to otu_table. Differential abundance analysis in this example is performed on compositionally transformed data.

ps_filt_fixed_samples_order_rel<- tax_transform(ps_filt_fixed_samples_order, trans = "compositional")
input_sample_metadata <- ps_filt_fixed_samples_order_rel%>% meta
input_data_order_Maaslin <- as.data.frame(otu_table(ps_filt_fixed_samples_order_rel))

MaAsLin2 requires a file with features as columns and samples as rows. Any other non-numeric columns (e.g., taxonomic labels) in the ASV table should be removed.

input_data_order_Maaslin <- input_data_order_Maaslin[sapply(input_data_order_Maaslin, is.numeric)]

It is important to make sure the input_metadata is a data frame.

For pairwise comparisons done in MaAsLin2 it is important to indicate which group of samples is the reference. There are 3 ways of doing that: - setting the column describing grouping variable as a factor with 1st level assigned to a control group, - changing names of the groups by prefixing the desired reference category with strings such as “a_” - MaAsLin2 by default treats the first category in alphabetical order as the reference. - by indicating which group (in which column of the metadata) is the reference.

In this tutorial we will use the last method.

Before running MaAsLin2 we need to define the output folder.

output_directory_order <- "Maaslin2_output_order"

Run the MaAsLin2 analysis. The code below uses a linear model to fit the data and identify associations between microbial features and metadata variables. Here we have set the parameters for abundance and prevalence filtering (ASVs that will be tested need to be present in at least 5% of samples). Therefore, we will use the phyloseq object before filtering by prevalence. Data is already normalized using total-sum scaling (TSS),therefore in the options of MaAsLin2 function we choose not to normalize the data for the second time. However, we choose to log transform it. P-values are adjusted using the default Benjamini-Hochberg (BH) method. For other options please check the documentation by executing C:/ProgramData/R/win-library/4.4/Maaslin2/help/Maaslin2 in the console. You can also set random effects to the model.

fit_data_order <- Maaslin2::Maaslin2(
    input_data = input_data_order_Maaslin,
    input_metadata = input_sample_metadata,
    output = output_directory_order,
    analysis_method = "LM",
    fixed_effects = "observation.unit.identifier",
    normalization = "NONE", 
    correction = "BH",
    transform = "LOG",
    min_abundance = 0,
    min_prevalence = 0.05,
    reference = "observation.unit.identifier,CW1-sand")

The result of the run is stored in the separate folder which we set above in the working directory. Visualization output files include separate boxplots for each significant association with categorical metadata and scatterplots with fitted regression line for continuous metadata. Additionally, a heatmap showing 50 top features (at order level) significantly associated with analyzed variables. All results are also saved in tables (incl. all associations and significant associations in separate files). There is also RDS file with all residuals of the model and a log file.

Full explanation can be found here: https://huttenhower.sph.harvard.edu/maaslin/

The first few lines of the significant results looks like that:

head(fit_data_order$results, n=20) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "1000px", height = "500px")

feature	metadata	value	coef	stderr	pval	name	qval	N	N.not.zero
o__Gemmatales	observation.unit.identifier	CW3-biochar	-3.698429	0.4957888	0.00e+00	observation.unit.identifierCW3-biochar	0.0000002	60	39
c__Gammaproteobacteria.Class	observation.unit.identifier	CW3-biochar	2.704977	0.4215605	0.00e+00	observation.unit.identifierCW3-biochar	0.0000058	60	32
c__Subgroup_11.Class	observation.unit.identifier	CW3-biochar	-1.769331	0.3071663	4.00e-07	observation.unit.identifierCW3-biochar	0.0000362	60	25
o__Polyangiales	observation.unit.identifier	CW3-biochar	3.046451	0.5297626	4.00e-07	observation.unit.identifierCW3-biochar	0.0000362	60	59
o__Gemmatales	observation.unit.identifier	CW2-biochar	-3.187981	0.5609210	5.00e-07	observation.unit.identifierCW2-biochar	0.0000372	60	39
o__Thermoanaerobaculales	observation.unit.identifier	CW3-biochar	2.907719	0.5315996	1.00e-06	observation.unit.identifierCW3-biochar	0.0000685	60	34
o__Thermomicrobiales	observation.unit.identifier	CW3-biochar	-1.801108	0.3509496	3.60e-06	observation.unit.identifierCW3-biochar	0.0002024	60	28
o__PAUC26f	observation.unit.identifier	CW3-biochar	-1.757618	0.3570018	7.60e-06	observation.unit.identifierCW3-biochar	0.0003762	60	28
c__MB.A2.108.Class	observation.unit.identifier	CW3-biochar	-1.762682	0.3699913	1.35e-05	observation.unit.identifierCW3-biochar	0.0005896	60	16
o__Ectothiorhodospirales	observation.unit.identifier	CW2-biochar	2.132338	0.4523215	1.61e-05	observation.unit.identifierCW2-biochar	0.0006331	60	30
o__Sphingomonadales	observation.unit.identifier	CW3-biochar	-1.857668	0.4003936	2.09e-05	observation.unit.identifierCW3-biochar	0.0007483	60	47
o__Reyranellales	observation.unit.identifier	CW3-biochar	-1.753530	0.3874612	3.11e-05	observation.unit.identifierCW3-biochar	0.0010209	60	31
o__Rhodobacterales	observation.unit.identifier	CW3-biochar	-2.405116	0.5548591	6.01e-05	observation.unit.identifierCW3-biochar	0.0017358	60	19
o__RBG.13.54.9	observation.unit.identifier	CW3-biochar	1.540412	0.3560167	6.17e-05	observation.unit.identifierCW3-biochar	0.0017358	60	45
c__Alphaproteobacteria.Class	observation.unit.identifier	CW3-biochar	-1.931087	0.4564534	8.55e-05	observation.unit.identifierCW3-biochar	0.0017857	60	41
o__Ectothiorhodospirales	observation.unit.identifier	CW3-biochar	1.701981	0.3997995	7.82e-05	observation.unit.identifierCW3-biochar	0.0017857	60	30
p__FCPU426.Phylum	observation.unit.identifier	CW3-biochar	1.297098	0.3035762	7.41e-05	observation.unit.identifierCW3-biochar	0.0017857	60	20
c__uncultured_Acidobacterium_sp..Class	observation.unit.identifier	CW3-biochar	-1.452569	0.3396592	7.32e-05	observation.unit.identifierCW3-biochar	0.0017857	60	17
c__BD2.11_terrestrial_group.Class	observation.unit.identifier	CW3-biochar	-1.608018	0.3802796	8.61e-05	observation.unit.identifierCW3-biochar	0.0017857	60	20
o__Sphingomonadales	observation.unit.identifier	CW2-biochar	-1.899004	0.4529937	9.74e-05	observation.unit.identifierCW2-biochar	0.0018759	60	47

This is the heatmap.

knitr::include_graphics("./Maaslin2_output_order/figures/heatmap.png")

heatmap, MaAslin2, order level

And one of the boxplots

knitr::include_graphics("./Maaslin2_output_order/figures/observation_unit_identifier_1.png")

boxplot, MaAslin2, order level

All MaAsLin2 analyses can be run also using other taxonomic ranks.

Linda - Linear (Lin) Model for Differential Abundance Analysis (DAA) of High-dimensional Compositional Data

LinDa is another approach to perform DAA which only requires fitting linear regression models on CLR-transformed abundance data as response. This method addresses the compositional effects by correcting the bias using the mode of the regression coefficients and can be extended to mixed-effect models for correlated microbiome data (Zhou et al. (2022)).

LinDA function has been integrated into the CRAN package MicrobiomeStat.

To run LinDa, a matrix of counts/proportions (row-features, column-samples) and a data frame containing sample metadata (without NAs) are needed. In this case, we will use the same phyloseq object (relative abundance, order level) as in MaAsLin2 to create the input data for LinDA.

asv.mat <- otu_table(ps_filt_fixed_samples_order_rel) %>% as.matrix()
metadata.dat <- meta(ps_filt_fixed_samples_order_rel) %>% as.data.frame()

Before running the model, consider which are the fixed and random effects in your dataset. Here, several variables could be tested for their effect on the microbiota composition (type of CW, depth, location, season). In this example, to compare LinDa results to MaasLin2 the type of CW is used as fixed effect. Similar to MaasLin2, the reference group for comparisons is typically determined by alphabetical order of the group labels in the metadata unless specified otherwise.

Prevalence filter will be set at 0.05 as in MaasLin2 above. Other parameters that need consideration for optimizing the performance of LinDa are mentioned in the R documentation of the function and in this article by Li et al. (2023).

linda.obj.order <- linda(feature.dat = asv.mat, meta.dat = metadata.dat,
                         feature.dat.type = 'proportion',
                         formula = '~observation.unit.identifier', 
                         prev.filter = 0.05)

## 70  features are filtered!
## The filtered data has  60  samples and  221  features will be tested!
## Fit linear models ...
## Completed.

The output message gives information about the results of the filtering (prevalence and/or abundance) and the method used for handling zeros in the dataset.

The output file is an object (list of elements) which includes the variable names of the fixed effects based on the formula used, the estimated bias of the regression coefficients due to the compositional effect and a list of data frames including the results of the DAA. All relevant information can be found here.

Let’s inspect the results of LinDA for each comparison.

linda.out <- as.data.frame(linda.obj.order$output)
head(linda.out, n=20) %>% kbl() %>% 
   kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "1000px", height = "500px")

	observation.unit.identifierCW2.biochar.baseMean	observation.unit.identifierCW2.biochar.log2FoldChange	observation.unit.identifierCW2.biochar.lfcSE	observation.unit.identifierCW2.biochar.stat	observation.unit.identifierCW2.biochar.pvalue	observation.unit.identifierCW2.biochar.padj	observation.unit.identifierCW2.biochar.reject	observation.unit.identifierCW2.biochar.df	observation.unit.identifierCW3.biochar.baseMean	observation.unit.identifierCW3.biochar.log2FoldChange	observation.unit.identifierCW3.biochar.lfcSE	observation.unit.identifierCW3.biochar.stat	observation.unit.identifierCW3.biochar.pvalue	observation.unit.identifierCW3.biochar.padj	observation.unit.identifierCW3.biochar.reject	observation.unit.identifierCW3.biochar.df
k__Bacteria Domain	2615.1623	-0.9464056	0.4701471	-2.0129990	0.0488465	0.1587511	FALSE	57	2615.1623	-0.9225019	0.4155552	-2.2199261	0.0304168	0.0873001	FALSE	57
o__Rickettsiales	1882.3681	-0.8726679	0.3684927	-2.3682092	0.0212886	0.0931557	FALSE	57	1882.3681	-1.2036368	0.3257046	-3.6954857	0.0004942	0.0034891	TRUE	57
o__Gemmatimonadales	16401.9780	-1.3938145	0.5580967	-2.4974426	0.0154208	0.0774682	FALSE	57	16401.9780	-1.8783182	0.4932924	-3.8077174	0.0003453	0.0029486	TRUE	57
o__Candidatus_Woesebacteria	870.2351	0.0857696	0.2033907	0.4216989	0.6748310	0.7920579	FALSE	57	870.2351	0.2128566	0.1797736	1.1840254	0.2413171	0.3555405	FALSE	57
o__Bacteroidales	2727.4743	1.4723457	0.7833587	1.8795293	0.0652873	0.1873831	FALSE	57	2727.4743	1.6314697	0.6923979	2.3562605	0.0219211	0.0692081	FALSE	57
o__Candidatus_Yanofskybacteria	837.3316	0.0025060	0.1057869	0.0236890	0.9811834	0.9904800	FALSE	57	837.3316	0.1017271	0.0935033	1.0879528	0.2811949	0.3933169	FALSE	57
o__Candidatus_Peribacteria	809.2383	-0.0509682	0.2828091	-0.1802213	0.8576181	0.9156212	FALSE	57	809.2383	-0.1244858	0.2499703	-0.4980025	0.6203983	0.6968140	FALSE	57
o__Gemmatales	18857.2374	-3.1372107	0.5098128	-6.1536524	0.0000001	0.0000178	TRUE	57	18857.2374	-3.6320969	0.4506151	-8.0603087	0.0000000	0.0000000	TRUE	57
o__Planctomycetales	23581.0263	-2.7635271	0.7660334	-3.6075803	0.0006519	0.0146899	TRUE	57	23581.0263	-1.8579543	0.6770843	-2.7440517	0.0081000	0.0314055	TRUE	57
o__Pirellulales	29513.5682	-0.4906958	0.5191926	-0.9451133	0.3485902	0.5135896	FALSE	57	29513.5682	-0.6835982	0.4589057	-1.4896266	0.1418353	0.2487745	FALSE	57
o__Opitutales	3113.7280	1.5026224	0.6038263	2.4885012	0.0157741	0.0774682	FALSE	57	3113.7280	1.5131727	0.5337121	2.8351855	0.0063269	0.0263819	TRUE	57
o__S-BQ2-57_soil_group	2211.2237	-0.8820417	0.3743333	-2.3563001	0.0219190	0.0931557	FALSE	57	2211.2237	-1.2297018	0.3308671	-3.7166041	0.0004621	0.0034044	TRUE	57
o__Methylacidiphilales	700.1043	0.2995374	0.1921255	1.5590716	0.1245149	0.2942982	FALSE	57	700.1043	0.4777785	0.1698166	2.8134973	0.0067130	0.0274587	TRUE	57
o__Verrucomicrobiales	6260.8613	-1.8827093	0.4593685	-4.0984728	0.0001332	0.0058880	TRUE	57	6260.8613	-1.4976424	0.4060282	-3.6885178	0.0005052	0.0034891	TRUE	57
o__Pedosphaerales	13430.6233	-0.2928035	0.4951305	-0.5913663	0.5566131	0.7151831	FALSE	57	13430.6233	-0.4257263	0.4376376	-0.9727826	0.3347716	0.4377782	FALSE	57
o__Chthoniobacterales	11366.3380	-1.0685677	0.5999567	-1.7810745	0.0802288	0.2136212	FALSE	57	11366.3380	-1.6244625	0.5302918	-3.0633367	0.0033392	0.0160426	TRUE	57
c__OM190 Class	5451.4862	-1.3953827	0.4325716	-3.2257841	0.0020825	0.0230117	TRUE	57	5451.4862	-1.1341031	0.3823429	-2.9661934	0.0043989	0.0198056	TRUE	57
o__11-24	1514.9431	-0.7997962	0.3304210	-2.4205372	0.0187066	0.0879609	FALSE	57	1514.9431	-0.6364988	0.2920536	-2.1793902	0.0334496	0.0918395	FALSE	57
o__Rhizobiales	115463.1362	0.1687520	0.4340918	0.3887474	0.6989114	0.7920996	FALSE	57	115463.1362	0.2083787	0.3836866	0.5430962	0.5891812	0.6746582	FALSE	57
o__Rhodobacterales	3731.6519	-1.2185165	0.5947999	-2.0486158	0.0451144	0.1533890	FALSE	57	3731.6519	-2.2240341	0.5257338	-4.2303425	0.0000856	0.0011125	TRUE	57

LinDA results can be plotted using the ‘linda.plot’ function which produces the effect size plot of the differential features and volcano plots based on the output. For more details on how the function works, see here.

In this function, the object from function linda is used as input and other arguments such as the variable whose results are to be plotted, significance level (alpha), log2fc cut-off and directory to save the output plots can be set.

The output file is a list of ggplot2 objects and more specifically, a list of effect size plots and a list of volcano plots found in the output directory. In both cases, each plot corresponds to one variable. Plots are saved in pdf format in the output directory. To store the figures separately, see below.

lfc.plot1 <- linda.res.plot.order$plot.lfc[[1]]
volc.plot1 <- linda.res.plot.order$plot.volcano[[1]]

ggsave("./linda_plots/lfc.plot1.png", lfc.plot1, device = png, width = 10, height = 7)
ggsave("./linda_plots/volc.plot1.png", volc.plot1, device = png, width = 10, height = 7)

knitr::include_graphics(path = "./linda_plots/lfc.plot1.png")

This is one of the effect size plots of the differentially abundant features (taxa at order level) between the different types of wetlands in the dataset. Debiased effect size is the adjusted effect size that accounts for the compositional nature of microbiome data (corrected for systematic biases-more reliable). Biased effect size is also reported for comparative reasons and to understand the magnitude of correction applied. It allows to see how much the compositional nature of the data may have affected the raw estimates.

knitr::include_graphics(path = "./linda_plots/volc.plot1.png")

This is one of the volcano plots of the differentially abundant features (taxa at order level) between the different types of wetlands in the dataset. Taxa (blue data points) in the upper right corner are enriched in the test group (CW2 or CW3) while in the upper left corner are depleted in the test group or enriched in the reference group (CW1).

Comparison of the DAA results from MaAsLin2 and LinDA

It is expected that different tools will give different output as they might test different hypotheses and use different statistical tests, transformations and they may assume different data distribution. Here, both MaAsLin2 and LinDA apply linear models, however in MaasLin2 - log transformation (default) was used, whereas in LinDA - CLR. Additionally, LinDA applied bias correction, that was initially introduced by ANCOM-BC workflow (Lin, Peddada, 2020).

To compare the results we will check how many features have statistically significantly different abundance when using both workflows. In order to do so, we will combine the results summary from MaAsLin2 and LinDA and focus on features with adjusted p value <0.05. Before merging results from both pipelines, we need to unify the names of the features, since MaAsLin2 replaced spaces, dashes and brackets with dots. Therefore, we will do the same with LinDA output.

linda.out_p.adj<- linda.out %>% rownames_to_column(var = "feature") %>%
       select(feature, observation.unit.identifierCW2.biochar.padj, observation.unit.identifierCW3.biochar.padj) %>%
  mutate(feature = str_replace_all(feature, " ", ".")) %>%
  mutate(feature = str_replace_all(feature, "-", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\(", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\)", ".")) %>%
       dplyr::rename(CW2 = observation.unit.identifierCW2.biochar.padj, CW3 = observation.unit.identifierCW3.biochar.padj) %>%
       pivot_longer(cols = c(CW2, CW3), names_to = "name", values_to = "p.adj_LinDA") %>%
       mutate(feature_group = paste0(feature, "_", name))

maaslin_p.adj<- fit_data_order$results %>% select(feature, value, qval) %>%
       dplyr::rename(name = value, p.adj_Maaslin = qval) %>%
       mutate(name = case_when(name == "CW2-biochar" ~ "CW2",
                               name == "CW3-biochar" ~ "CW3")) %>%
       mutate(feature_group = paste0(feature, "_", name))

Although the same prevalence filtering was set in case of MaasLin2 and LinDA, the amount of tested features (taxa) is different. As we can see below all tested features included in MaAsLin2 analysis can also be found in LinDA analysis.However, it is not true for the reverse association.

all(maaslin_p.adj$feature %in% linda.out_p.adj$feature)

## [1] TRUE

all(linda.out_p.adj$feature %in% maaslin_p.adj$feature)

## [1] FALSE

The reason for that is that in the case of LinDA more features were included in the analysis after applying the prevalence filter. In the case of MaAsLin2, it is {r length(fit_data_order\$results %\>% select(feature) %\>% unique %\>% pull(feature))} whereas in the case of LinDA {r dim(linda.out)[1]}. Different number of features are left after prevalence filtering due to the used formula: - MaAsLin2 keeps features when present in higher amount of samples than the prevalence threshold (in this case >3, which means at least 4). - LinDA keep features when at least a prevalence threshold is reached (>=3, which means at least 3).

joined_result<- full_join(linda.out_p.adj, maaslin_p.adj[,c(3,4)], by = "feature_group")
joined_result_signif<-joined_result %>% filter(p.adj_LinDA<0.05 | p.adj_Maaslin<0.05) %>%
  mutate(feature_name = paste(feature, "_", name))

The Venn diagrams presented below show the number of differentially abundant features which were detected using both pipelines (intersection of circles), only by LinDA (blue) and only by MaAsLin2 (red) when comparing pairwisely the types of CW.

LinDA_CW2<- joined_result_signif%>%filter(name == "CW2" & p.adj_LinDA<0.05) %>% pull(feature)
LinDA_CW3<- joined_result_signif%>%filter(name == "CW3" & p.adj_LinDA<0.05) %>% pull(feature)
Maaslin_CW2<- joined_result_signif%>%filter(name == "CW2" & p.adj_Maaslin<0.05) %>% pull(feature)
Maaslin_CW3<- joined_result_signif%>%filter(name == "CW3" & p.adj_Maaslin<0.05) %>% pull(feature)
Venn_CW2<- list(LinDA = LinDA_CW2, Maaslin2 = Maaslin_CW2)
Venn_CW3<- list(LinDA = LinDA_CW3, Maaslin2 = Maaslin_CW3)
Venn_CW2 <- Venn(Venn_CW2)
Venn_CW2 <- process_data(Venn_CW2, shape_id = "201")
venn_plot_CW2<-plot_venn(Venn_CW2)+
  scale_fill_distiller(palette = "RdBu")+
  ggtitle("CW1 vs. CW2")
Venn_CW3 <- Venn(Venn_CW3)
Venn_CW3 <- process_data(Venn_CW3, shape_id = "201")
venn_plot_CW3<- plot_venn(Venn_CW3)+
  scale_fill_distiller(palette = "RdBu")+
  ggtitle("CW1 vs. CW3")

ggarrange(venn_plot_CW2, venn_plot_CW3, nrow=2)

Over 70% of the total number of features were detected by both pipelines indicating a significant overlap in the results. In general, MaAsLin2 detects higher number of differentially abundant features in each comparison that were not detected by LinDA.

Features shown to be different using MaAsLin2 only:

for comparison between CW1 and CW2: o__Rickettsiales, o__11.24, o__Immundisolibacterales, o__Kiloniellales, o__Latescibacterales, o__Nitrospirales, o__Subgroup_15, o__Propionibacteriales, o__Micrococcales, o__uncultured_actinobacterium
for comparison between CW1 and CW3: k__Bacteria.Domain, o__NRL2, o__Ferrovibrionales, o__Puniceispirillales, o__Salinisphaerales, o__Isosphaerales, c__Pla4_lineage.Class, o__Subgroup_15, o__uncultured_Anaerolineaceae_bacterium, o__Lineage_IV, c__Acidimicrobiia.Class, o__uncultured_Catenulispora_sp.

Features shown to be different using LinDA only:

for comparison between CW1 and CW2: o__053A03.B.DI.P58, o__1.20
for comparison between CW1 and CW3: o__Methylacidiphilales, o__Thiotrichales, o__Myxococcales

The relationship between log2 fold changes (between reference and tested group) calculated by LinDA and coefficients calculated by MaAsLin2 can be assessed by the following scatter plots. The strength of the relationship was measured by calculating a Pearson correlation coefficient (R).

linda.out_log_change<- linda.out %>% rownames_to_column(var = "feature") %>%
       select(feature, CW2 = observation.unit.identifierCW2.biochar.log2FoldChange, CW3 = observation.unit.identifierCW3.biochar.log2FoldChange) %>%
  mutate(feature = str_replace_all(feature, " ", ".")) %>%
  mutate(feature = str_replace_all(feature, "-", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\(", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\)", ".")) %>%
       pivot_longer(cols = c(CW2, CW3), names_to = "name", values_to = "log_change_LinDA") %>%
  mutate(feature_name = paste0(feature, "_", name))

Maaslin_coef<- fit_data_order$results %>% select(feature, name = value, coef_Maaslin2 = coef) %>%
       mutate(name = case_when(name == "CW2-biochar" ~ "CW2",
                               name == "CW3-biochar" ~ "CW3")) %>%
  mutate(feature_name = paste0(feature, "_", name))

joined_result_coef<-linda.out_log_change %>% full_join(Maaslin_coef[,c(3,4)], by="feature_name") %>%
       left_join(joined_result[,c(3,4,5)], by = c("feature_name" = "feature_group")) %>% mutate(signif = case_when(p.adj_Maaslin <=0.05 & p.adj_LinDA <= 0.05 ~ "signif_both",
          p.adj_Maaslin <=0.05 & p.adj_LinDA > 0.05 ~ "signif_Maaslin2",
          p.adj_Maaslin >0.05 & p.adj_LinDA <= 0.05 ~ "signif_LinDA",
          is.na(p.adj_Maaslin) & p.adj_LinDA <= 0.05 ~ "signif_LinDA",
                                                      TRUE ~ "NS"))
joined_result_coef %>% 
  ggplot(aes(x = log_change_LinDA, y = coef_Maaslin2)) +
  geom_jitter(shape = 21, alpha = 0.6, size = 2, width = 0.3, height = 0.1, aes(fill = signif))+
  stat_cor(method = "pearson", label.x = -2, label.y = 2)+
  facet_wrap(~name)+
  scale_fill_nejm()+
  theme_bw()

On the plot we can see that the direction of differences is the same. The two variables have a positive association and a strong linear relationship.

Other useful resources

Tutorials

Packages

A list of R environment based tools for microbiome data exploration, statistical analysis and visualization

sessionInfo()

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_World.utf8  LC_CTYPE=English_World.utf8   
## [3] LC_MONETARY=English_World.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_World.utf8    
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggVennDiagram_1.5.2   stringr_1.5.1         plyr_1.8.9           
##  [4] dplyr_1.1.4           MicrobiomeStat_1.2    biomeUtils_0.022     
##  [7] phylotools_0.2.2      Biostrings_2.72.1     GenomeInfoDb_1.40.1  
## [10] XVector_0.44.0        IRanges_2.38.1        S4Vectors_0.42.1     
## [13] BiocGenerics_0.50.0   ape_5.8               ComplexHeatmap_2.20.0
## [16] ranacapa_0.1.0        patchwork_1.3.0       GGally_2.2.1         
## [19] Maaslin2_1.18.0       here_1.0.1            rstatix_0.7.2        
## [22] kableExtra_1.4.0      ggsci_3.2.0           forcats_1.0.0        
## [25] RColorBrewer_1.1-3    scales_1.3.0          pals_1.9             
## [28] ggpubr_0.6.0          tidyr_1.3.1           tibble_3.2.1         
## [31] ggtree_3.12.0         microViz_0.12.6       vegan_2.6-8          
## [34] lattice_0.22-6        permute_0.9-7         microbiome_1.26.0    
## [37] ggplot2_3.5.1         phyloseq_1.48.0       taxize_0.10.0        
## [40] rentrez_1.2.3        
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.1           urltools_1.7.3          ggplotify_0.1.2        
##   [4] triebeard_0.4.1         rpart_4.1.24            XML_3.99-0.18          
##   [7] lifecycle_1.0.4         Rdpack_2.6.2            doParallel_1.0.17      
##  [10] rprojroot_2.0.4         MASS_7.3-61             backports_1.5.0        
##  [13] magrittr_2.0.3          sass_0.4.9              rmarkdown_2.29         
##  [16] jquerylib_0.1.4         yaml_2.3.10             cowplot_1.1.3          
##  [19] pbapply_1.7-2           mapproj_1.2.11          DBI_1.2.3              
##  [22] minqa_1.2.8             ade4_1.7-22             maps_3.4.2.1           
##  [25] abind_1.4-8             zlibbioc_1.50.0         Rtsne_0.17             
##  [28] quadprog_1.5-8          purrr_1.0.4             hash_2.2.6.3           
##  [31] yulab.utils_0.2.0       seriation_1.5.7         circlize_0.4.16        
##  [34] GenomeInfoDbData_1.2.12 rmutil_1.1.10           ggrepel_0.9.6          
##  [37] tidytree_0.4.6          crul_1.5.0              pheatmap_1.0.12        
##  [40] spatial_7.3-18          svglite_2.1.3           codetools_0.2-20       
##  [43] getopt_1.20.4           xml2_1.3.6              tidyselect_1.2.1       
##  [46] shape_1.4.6.1           aplot_0.2.4             httpcode_0.3.0         
##  [49] UCSC.utils_1.0.0        farver_2.1.2            stable_1.1.6           
##  [52] lme4_1.1-36             TSP_1.2-4               matrixStats_1.5.0      
##  [55] jsonlite_1.8.9          GetoptLong_1.0.5        multtest_2.60.0        
##  [58] Formula_1.2-5           survival_3.7-0          iterators_1.0.14       
##  [61] systemfonts_1.2.1       foreach_1.5.2           tools_4.4.1            
##  [64] treeio_1.28.0           Rcpp_1.0.13             glue_1.8.0             
##  [67] xfun_0.49               mgcv_1.9-1              ca_0.71.1              
##  [70] numDeriv_2016.8-1.1     withr_3.0.2             timeSeries_4041.111    
##  [73] fastmap_1.2.0           boot_1.3-31             rhdf5filters_1.16.0    
##  [76] digest_0.6.37           R6_2.6.0                gridGraphics_0.5-1     
##  [79] colorspace_2.1-1        Cairo_1.6-2             dichromat_2.0-0.1      
##  [82] modeest_2.4.0           utf8_1.2.4              generics_0.1.3         
##  [85] data.table_1.16.2       DECIPHER_3.0.0          robustbase_0.99-4-1    
##  [88] httr_1.4.7              ggstats_0.8.0           pkgconfig_2.0.3        
##  [91] gtable_0.3.6            timeDate_4041.110       registry_0.5-1         
##  [94] picante_1.8.2           pcaPP_2.0-5             htmltools_0.5.8.1      
##  [97] carData_3.0-5           biomformat_1.32.0       clue_0.3-66            
## [100] logging_0.10-108        Biobase_2.64.0          png_0.1-8              
## [103] optparse_1.7.5          reformulas_0.4.0        ggfun_0.1.8            
## [106] knitr_1.49              rstudioapi_0.17.1       reshape2_1.4.4         
## [109] rjson_0.2.23            statip_0.2.3            nloptr_2.1.1           
## [112] nlme_3.1-166            curl_6.2.0              cachem_1.1.0           
## [115] zoo_1.8-12              rhdf5_2.48.0            GlobalOptions_0.1.2    
## [118] parallel_4.4.1          fBasics_4041.97         pillar_1.10.1          
## [121] vctrs_0.6.5             car_3.1-3               cluster_2.1.6          
## [124] evaluate_1.0.3          mvtnorm_1.3-3           cli_3.6.3              
## [127] compiler_4.4.1          rlang_1.1.4             crayon_1.5.3           
## [130] ggsignif_0.6.4          labeling_0.4.3          fs_1.6.5               
## [133] stringi_1.8.4           viridisLite_0.4.2       lmerTest_3.1-3         
## [136] munsell_0.5.1           lazyeval_0.2.2          Matrix_1.7-1           
## [139] stabledist_0.7-2        Rhdf5lib_1.26.0         statmod_1.5.0          
## [142] rbibutils_2.3           igraph_2.1.1            broom_1.0.7            
## [145] bslib_0.9.0             phangorn_2.12.1         biglm_0.9-3            
## [148] fastmatch_1.1-6         DEoptimR_1.1-3-1

Amplicon sequencing tutorial using wetlands dataset

Anna Malinowska, Baris Özdinç & Asimenia Gavriilidou

2025-02-19