Introduction

The following script shows basic analyses of 16S rRNA amplicon sequencing data.If needed RMarkdown file can be downloaded here.
Script includes some pre-processing steps, as well as calculations, statistical analyses and visualizations. Raw paired-end read files were used as input to generate de novo Amplicon Sequence Variants (ASVs) with NG-Tax 2.0.

This sample dataset comes from a project aimed at studying the role of microbial communities on micropollutant removal in bark-biochar vertical subsurface flow constructed wetlands. Since conventional wastewater treatment plants are inefficient in removing micropollutants, there is a need for post-treatment methods that are feasible and have high removal efficiency. One of such treatment methods might be constructed wetlands (CW), which are man-made structures built to imitate the mechanisms of natural wetland systems, such as shallow ponds, mangroves, trenches, etc. In CW, a basin is filled with the solid material on which microorganisms can form biofilms. Subsequently, natural processes, such as biodegradation, sorption and plant uptake, result in the removal of micropollutants.

In this project CW basins were filled with a following solid filling material:

  1. Sand (group CW1-sand) - which is conventional filling

  2. Combination of sand and bark-biochar (built with an innovative layer of a combination of bark and biochar in between layers of sand) (CW2-biochar)

  3. Layer of bark-biochar only (CW3-biochar)

The water was distributed over the top of the CW and subsequently kept in the system for 8 h. One of the aims of the study was to identify the microorganisms that are present in 3 types of constructed wetlands with different solid filling material over the depth of the constructed wetland. Samples were taken from 3 CW from the following depths:

  • 30 cm

  • 50 cm

  • 70 cm

  • 90 cm

All constructed wetlands were divided into 2 parts (A and D) and samples were taken from both locations. Additionally, samples were taken in winter and summer, since the temperature and other environmental factors might have the impact on the composition of the wetlands microbiome and therefore their functional properties.

Preparations

Attaching necessary libraries

knitr::opts_chunk$set(echo = TRUE, include=TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra = "")
library(rentrez)
library(taxize)
library(phyloseq)
library(microbiome)
library(vegan)
library(microViz)
library(ggtree)
library(tibble)
library(tidyr)
library(ggplot2)
library(ggpubr)
library(pals)
library(scales)
library(RColorBrewer)
library(forcats)
library(ggsci)
library(kableExtra)
library(rstatix)
library(here)
library(Maaslin2)
library(GGally)
library(patchwork)
library(ranacapa)
library(ComplexHeatmap)
library(ape)
library(Biostrings)
library(phylotools)
library(biomeUtils)
library(MicrobiomeStat)
library(dplyr)
library(plyr)
library(stringr)
library(ggVennDiagram)

Setting the work directory

Remember to indicate the directory where all your files will be saved. The working directory should contain the biom file as exported by NG-Tax 2.0 and the metadata file. The metadata file should include descriptive information about your experiment, such as collection date, associated environment or location, used to contextualize and analyze data.

Below an example is given. This code will set the working directory to the path where the project is located.

here::i_am("basic_script.Rmd")
script_path <- here::here()
cat("The project root is located at:", script_path, "\n")
## The project root is located at: C:/Users/malin009/OneDrive - Wageningen University & Research/Dry_lab_intro/tutorials/16S/wetlands
setwd(script_path)

Loading the NGTax output and creating a phyloseq object

Phyloseq objects are commonly used for analyzing microbiome data in R. The phyloseq object can contain the following objects:

  • otu_table(): a table that contains in the columns the sample_ID and the rows the OTUs/ASVs, the OTU/ASV abundance (count) table per sample

  • tax_table():A table that presents the taxonomic assignment of each OTU

  • sample_data(): A table that contains the metadata of the samples

  • phy_tree(): The phylogenetic tree representing the evolutionary relationships between the microbial taxa in the dataset

  • refseq(): DNA string set - OTUs/ASVs sequences

The biom file created by NG-Tax 2.0 is composed only of otu_table and tax_table. Metadata and taxonomy tree need to be attached separately. In the metadata file, sample identifiers need to be the same as in the mapping file for NG-Tax 2.0. Additionally, if the composition of reference mocks 3 and 4 needs to be added manually, add 2 samples in the metadata file, namely refmock3 and refmock4. Only sample identifiers need to be given. All other columns can be blank.

The taxonomy tree is stored in the NG-Tax output under the name “all_otus.tree” and all sequences used to generate the tree are stored in the file “all_otus.fasta”.

ps <- read_phyloseq(otu.file = "wetlands_filling_tmock.biom", 
                           taxonomy.file = NULL,
                           metadata.file = "metadata_UNLOCK_wetlands.csv", 
                           type = "biom")

treefile <- read.tree("all_otus_mock.tree")
ps <- merge_phyloseq(ps, treefile)
print(ps)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4074 taxa and 69 samples ]
## sample_data() Sample Data:       [ 69 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 4074 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 4074 tips and 4073 internal nodes ]
ps_sum<- summarize_phyloseq(ps)
data.frame(do.call(rbind, ps_sum[c(1:10)]))
##                                                                          do.call.rbind..ps_sum.c.1.10...
## 1                                                                           1] Min. number of reads = 14
## 2                                                                       2] Max. number of reads = 917500
## 3                                                                     3] Total number of reads = 3018962
## 4                                                          4] Average number of reads = 43753.0724637681
## 5                                                                      5] Median number of reads = 17925
## 6                                                                        7] Sparsity = 0.929852084267145
## 7                                                                        6] Any OTU sum to 1 or less? NO
## 8                                                                            8] Number of singletons = 0
## 9  9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0
## 10                                                                 10] Number of sample variables are: 7

Here, you can check if your phyloseq object includes all the necessary data. Plus, the summary prints basic information of the data (reads, sparsity, singletons).

Inspecting parts of the phyloseq object

This code will return first 10 rows of otu table.

head(ps@otu_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
1_CW1_A_D30_1 1_CW1_A_D50_1 1_CW1_A_D50_2 1_CW1_A_D70_1 1_CW1_A_D70_2 1_CW1_A_D90_1 1_CW1_D_D30_1 1_CW1_D_D50_1 1_CW2_A_D50_2 1_CW2_A_D70_1 1_CW2_A_D93_1 1_CW2_A_D93_2 1_CW2_D_D50_1 1_CW2_D_D70_1 1_CW2_D_D93_1 1_CW3_A_D50_1 1_CW3_A_D50_2 1_CW3_A_D70_1 1_CW3_A_D70_2 1_CW3_A_D93_1 1_CW3_A_D93_2 1_CW3_D_D50_1 1_CW3_D_D70_1 1_CW3_D_D93_1 1_CW3_D_D93_2 1_CW3_D_D93_3 1_CW3_D50_1 1_CW3_D70_1 1_CWI_D_D70_1 1_CWI_D_D70_2 1_CWI_D_D90_1 1_CWI_D_D90_2 2_CW1_A_50_1 2_CW1_A_50_2 2_CW1_A_90_1 2_CW1_A_90_2 2_CW1_A70_1 2_CW1_D_50_1 2_CW1_D_50_2 2_CW1_D_70_1 2_CW1_D90_1 2_CW2_A_50_1 2_CW2_A_50_2 2_CW2_A_70_1 2_CW2_A_90_1 2_CW2_D_50_1 2_CW2_D_70_1 2_CW2_D_70_2 2_CW2_D50_1 2_CW3_A_50_1 2_CW3_A_50_2 2_CW3_A_70_1 2_CW3_A_70_2 2_CW3_A_90_1 2_CW3_A_90R_1 2_CW3_D_50_1 2_CW3_D_50_2 2_CW3_D_70_1 2_CW3_D_90_1 2_CW3_D_90_2 Mock3 Mock4 Water_1 Water_2 Water_3 Water_4 Water_5 refmock3 refmock4
1632941857 0 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632942152 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632942190 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 83 0 57 0 0 0 0 0 0 0 46 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632941008 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632942270 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 35 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632941052 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
163294987 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
163294896 0 0 0 0 0 133 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 0 0 60 0 57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632942641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1632942416 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The samples are in columns and ASVs are in rows. The values in the table indicate the absolute abundance (counts) of reads per ASV per sample.

This code will return first 10 rows of taxonomy table.

head(ps@tax_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
Domain Phylum Class Order Family Genus Species
1632941857 NA p__ c__ o__ f__ g__ s__
1632942152 NA p__ c__ o__ f__ g__ s__
1632942190 NA p__ c__ o__ f__ g__ s__
1632941008 NA p__ c__ o__ f__ g__ s__
1632942270 NA p__ c__ o__ f__ g__ s__
1632941052 NA p__ c__ o__ f__ g__ s__
163294987 NA p__ c__ o__ f__ g__ s__
163294896 NA p__ c__ o__ f__ g__ s__
1632942641 NA p__ c__ o__ f__ g__ s__
1632942416 k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__

We can see that many ASVs are not assigned to any taxonomy (“NA”). What is more, in some cases there is missing genus (“g__”) or family (“f__”) level taxonomy. This can happen for several reasons, such as lack of reference sequences in the database or low sequence similarity. When there are many mismatches with a sequence in the database or the sequences cannot be distinguished with high confidence, NG-Tax does not assign taxonomy or classifies one rank lower to ensure correct taxonomic assignment.

Domain Phylum Class Order Family Genus Species
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__ g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__ g__ s__
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rickettsiales f__Mitochondria g__ s__
k__Bacteria p__Cyanobacteria c__Cyanobacteriia o__Chloroplast f__ g__ s__
k__Bacteria p__Bacteroidota c__Bacteroidia o__Bacteroidales f__ g__ s__

It is good to clean a bit the taxonomy table before you proceed with the analysis and for example change the names of ASVs in such a way that the highest known taxonomy name is always shown (for example when Family is known, but for Genus no annotation was found). This way each ASV will keep as much of individuality as possible. It can be done manually or using the tax_fix function from microViz package (see Preprocessing Steps below).

To inspect the phylogenetic tree attached to the phyloseq object and generated from the ASV sequences, the following code can be used:

ggtree(ps@phy_tree, options(ignore.negative.edge=TRUE))

Metadata information can also be inspected as shown below. The code will print the first 10 rows of the sample data attached to the phyloseq object.

metadata<- ps %>% meta()
head(metadata, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
sample.description sample.name observation.unit.identifier collection.date season location depth
1_CW1_A_D30_1 Lib1_CW1-sand_A_D30_30_Jan_23_Sandfilled 1_Lib1_CW1-sand_A_D30 CW1-sand 1/30/2023 winter A 30
1_CW1_A_D50_1 Lib1_CW1-sand_A_D50_30_Jan_23_Sandfilled 2_Lib1_CW1-sand_A_D50 CW1-sand 1/30/2023 winter A 50
1_CW1_A_D50_2 Lib1_CW1-sand_A_D50_30_Jan_23_Sandfilled 3_Lib1_CW1-sand_A_D50 CW1-sand 1/30/2023 winter A 50
1_CW1_A_D70_1 Lib1_CW1-sand_A_D70_30_Jan_23_Sandfilled 4_Lib1_CW1-sand_A_D70 CW1-sand 1/30/2023 winter A 70
1_CW1_A_D70_2 Lib1_CW1-sand_A_D70_30_Jan_23_Sandfilled 5_Lib1_CW1-sand_A_D70 CW1-sand 1/30/2023 winter A 70
1_CW1_A_D90_1 Lib1_CW1-sand_A_D90_30_Jan_23_Sandfilled 6_Lib1_CW1-sand_A_D90 CW1-sand 1/30/2023 winter A 90
1_CW1_D_D30_1 Lib1_CW1-sand_D_D30_30_Jan_23_B/B1filled 7_Lib1_CW1-sand_D_D30 CW1-sand 1/30/2023 winter D 30
1_CW1_D_D50_1 Lib1_CW1-sand_D_D50_30_Jan_23_B/B1filled 8_Lib1_CW1-sand_D_D50 CW1-sand 1/30/2023 winter D 50
1_CW2_A_D50_2 Lib1_CW2-biochar_A_D50_30_Jan_23_B/B1filled 9_Lib1_CW2-biochar_A_D50 CW2-biochar 1/30/2023 winter A 50
1_CW2_A_D70_1 Lib1_CW2-biochar_A_D70_30_Jan_23_B/B1filled 10_Lib1_CW2-biochar_A_D70 CW2-biochar 1/30/2023 winter A 70

Preprocessing steps

Cleaning taxonomy table

In the process of 16S gene amplification some eukaryotic sequences, like chloroplasts or mitochondria, are likely to be amplified and should be identified and discarded (Hanshew et al., 2013). The code below shows the number of ASVs annotated as “o__Chloroplast” and “f__Mitochondria” and calculates the fraction of these ASVs. Before filtering out contaminant sequences, it is important to know their proportion to have a better overview of the data.

Chloroplast <- ntaxa(subset_taxa(ps, Order == "o__Chloroplast"))
Mitochondria <- ntaxa(subset_taxa(ps, Family == "f__Mitochondria"))
chl_mt_df<- data.frame(Taxa = c("Chloroplast", "Mitochondria"),
                       nASVs = c(Chloroplast, Mitochondria)) %>%
  mutate(fraction = round(nASVs/ntaxa(ps), 3))
chl_mt_df
##           Taxa nASVs fraction
## 1  Chloroplast    48    0.012
## 2 Mitochondria     8    0.002

The code below will discard ASVs classified as Mitochondria and Chloroplasts.

ps_filt<- subset_taxa(ps, Order != "o__Chloroplast")
ps_filt<- subset_taxa(ps_filt, Family != "f__Mitochondria")

The primers used for 16S rRNA amplification are designed to target both Bacteria and Archaea and therefore, it’s expected to find both in the dataset. Additionally, in the taxonomy table there might be other ASVs that are not annotated to any known Domain (“NA”). The table below shows the number of ASVs and their fraction classified at the Domain level, including ASVs of unknown Domain.

ps_filt@tax_table %>% data.frame() %>% group_by(Domain) %>% dplyr::count() %>%
  mutate(fraction = round(n/ntaxa(ps_filt), 3))
## # A tibble: 3 × 3
## # Groups:   Domain [3]
##   Domain          n fraction
##   <chr>       <int>    <dbl>
## 1 NA             55    0.014
## 2 k__Archaea     47    0.012
## 3 k__Bacteria  3916    0.975

There are 55 ASVs (1.4%) not assigned to any Domain and 47 ASVs (1.2%) assigned to Archaea. You may check the ASVs sequences for unassigned taxa by searching in the “all_otus.fasta” and then Blast them to make sure these are not bacteria. The code below parses the sequences of unassigned ASVs into a separate file that can be uploaded into BLAST.

all_asvs <- readDNAStringSet("all_otus.fasta")
unassigned_asvs <- taxa_names(subset_taxa(ps, Domain == "NA") %>% tax_table())
all_asvs_df <- as.data.frame(all_asvs)
unassigned_asvs <- all_asvs_df %>% filter(row.names(all_asvs_df) %in% unassigned_asvs) %>%
       rownames_to_column(var = "seq.name") %>% `colnames<-`(c("seq.name", "seq.text"))
dat2fasta(unassigned_asvs, outfile = "out.fasta")
## out.fasta has been saved to  C:/Users/malin009/OneDrive - Wageningen University & Research/Dry_lab_intro/tutorials/16S/wetlands

The blast is done against “core nucleotide database” and we advise to limit maximum target sequences to 10, this way it is easier to go through all the data. To upload the result in R, a result of blast in csv format should be downloaded (here: “out.csv”). It is not possible to download result with taxonomy description, therefore packages rentrez and taxize will be used to retrieve this information. They use API to get the data from ncbi website.

blast_df <- read.csv("out.csv", header = FALSE)
colnames(blast_df) <- c("ASV", "sseqid", "per.ident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
sseqid<- unique(blast_df$sseqid)
taxa<- genbank2uid(sseqid)
taxa_df<- ldply(taxa) %>% dplyr::rename(id = V1) %>% mutate(GenBank = sseqid, name = sapply(taxa, function(x) attr(x, "name")))
tax_ids<-unique(taxa_df$id)
tax_summary<- entrez_summary(tax_ids, db = "taxonomy")
entrez_summary<-ldply(tax_summary, data.frame)
entrez_summary_sel<-entrez_summary %>%select(taxid, division, scientificname, genus, species, genbankdivision) %>% mutate(taxid = as.character(taxid))
blast_result<- blast_df %>% left_join(taxa_df, by=c("sseqid" = "GenBank")) %>% left_join(entrez_summary_sel, by=c("id" = "taxid"))

In order to decide whether each ASV comes from bacteria or not we will leave only the top result for each ASV (the one that has the lowest E value, which is the number of expected hits of similar quality (score) that could be found just by chance). Information about classification of taxa is stored as “division”. To simplify the output we will add a new column which will inform whether a sequence comes from bacteria, archae or eukaryota. All remaining ones are “unclassified sequences”. First we need to check what unique “divisions” are in our output and decide about new, simplified grouping (Be careful - this part of code is specific to the output).

unique(blast_result$division)
##  [1] "bacteria"                        "d-proteobacteria"               
##  [3] "ascomycete fungi"                "unclassified sequences"         
##  [5] "diatoms"                         "proteobacteria"                 
##  [7] "green algae"                     "eukaryotes"                     
##  [9] "brown algae"                     "archaea"                        
## [11] "euryarchaeotes"                  "firmicutes"                     
## [13] "flowering plants"                "high G+C Gram-positive bacteria"
## [15] "CFB group bacteria"
Bacteria<- c("bacteria", "d-proteobacteria", "proteobacteria", "firmicutes", "high G+C Gram-positive bacteria", "CFB group bacteria")
Eukaryota<- c("ascomycete fungi", "diatoms", "green algae", "eukaryotes", "brown algae", "flowering plants")
Archaea<- c("archaea", "euryarchaeotes")
blast_result<- blast_result %>% mutate(Domain = case_when(division %in% Bacteria ~ "Bacteria",
                                                division %in% Eukaryota ~ "Eukaryota",
                                                division %in% Archaea ~ "Archaea",
                                                TRUE ~ division))
blast_annot<- blast_result %>% group_by(ASV) %>% slice_head(n=1)%>% select(ASV, Domain)
Bacteria_ASVs<- blast_annot %>% filter(Domain == "Bacteria") %>% pull(ASV)
Archaea_ASVs<- blast_annot %>% filter(Domain == "Archaea") %>% pull(ASV)
Eukaryota_ASVs<- blast_annot %>% filter(Domain == "Eukaryota") %>% pull(ASV)
unclassified_prokaryote_ASVs<- blast_annot %>% filter(Domain == "unclassified sequences") %>% pull(ASV)

Only ASVs annotated in blast output as bacteria will be kept in the final phyloseq object. They account for 54.5454545 % of all blasted sequences.

The code below shows how to manually change annotations in the tax_table so all ASVs that are Bacteria but not annotated with SILVA will get the Bacteria domain name. For all the other taxonomy levels the names will remain to be NA. After performing this manual annotation, we will check the the number of ASVs in each Domain.

ps_filt<-ps_filt %>% tax_mutate(ASV = taxa_names(ps_filt)) %>%
       tax_mutate(Domain = case_when(ASV %in% Bacteria_ASVs ~ "k__Bacteria",
                                     ASV %in% Archaea_ASVs ~ "k__Archaea",
                                     ASV %in% Eukaryota_ASVs ~ "k__Eukaryota",
                                     ASV %in% unclassified_prokaryote_ASVs ~ "k__unclassified_prokaryota",
                                               TRUE ~ Domain))
ps_filt@tax_table %>% data.frame() %>% group_by(Domain) %>% dplyr::count() %>%
  mutate(fraction = round(n/ntaxa(ps_filt), 3))
## # A tibble: 4 × 3
## # Groups:   Domain [4]
##   Domain                         n fraction
##   <chr>                      <int>    <dbl>
## 1 k__Archaea                    49    0.012
## 2 k__Bacteria                 3946    0.982
## 3 k__Eukaryota                  16    0.004
## 4 k__unclassified_prokaryota     7    0.002

Depending on the aim of the study, ASVs belonging to both Archaea and Bacteria might be of interest. If the project focus is on bacteria only, all ASVs that are not bacteria should be filtered out. It is important to keep all taxa that are bacteria, otherwise it can disturb the relative abundance of each taxon. For the purpose of this tutorial, we decided to leave only Bacteria in the dataset.

ps_filt<- subset_taxa(ps_filt, Domain == "k__Bacteria")
ps_filt@tax_table %>% data.frame() %>% group_by(Phylum) %>% dplyr::count() %>%
  arrange(desc(n)) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
Phylum n
p__Proteobacteria 1117
p__Planctomycetota 438
p__Bacteroidota 390
p__Acidobacteriota 320
p__Chloroflexi 304
p__Verrucomicrobiota 214
p__Actinobacteriota 208
p__Firmicutes 203
p__Myxococcota 150
p__Desulfobacterota 128
p__Gemmatimonadota 77
p__Spirochaetota 52
p__Nitrospirota 41
p__Latescibacterota 39
p__Bdellovibrionota 33
p__Cyanobacteria 31
p__ 30
p__Patescibacteria 24
p__Armatimonadota 23
p__Fibrobacterota 16
p__Zixibacteria 15
p__SAR324_clade(Marine_group_B) 11
p__Elusimicrobiota 10
p__FCPU426 10
p__Dependentiae 9
p__MBNT15 9
p__RCP2-54 9
p__Entotheonellaeota 7
p__Methylomirabilota 4
p__Sumerlaeota 4
p__Campylobacterota 2
p__Cloacimonadota 2
p__PAUC34f 2
p__WPS-2 2
p__AncK6 1
p__DTB120 1
p__Deferrisomatota 1
p__Deinococcota 1
p__Fermentibacterota 1
p__Fusobacteriota 1
p__Hydrogenedentes 1
p__Margulisbacteria 1
p__NB1-j 1
p__Sva0485 1
p__Synergistota 1
p__WS2 1

Fixing taxonomy table

Now, the table includes only ASVs classified to a known domain. For other taxonomic levels, there are some unassigned ASVs. The following code will replace the uninformative or unknown entries (“unknowns”) with information from the highest known taxonomic rank (“classified”). It is important to understand the new annotation of taxa. Therefore, we recommend further reading of the tax_fix documentation. Important to understand is that when a taxon is annotated with the full name of another taxonomic rank (for example a genus is annotated with f__Enterobacteriaceae Family), it means that this taxon (genus in this example) is composed of all taxa annotated to the lower taxonomic rank (highest known - in this case Enterobacteriaceae family) but not annotated with the analyzed taxonomic rank (in this case genus).

ps_filt_fixed<- ps_filt %>%
 tax_fix(
  min_length = 4,
  unknowns = c("g____", "f____", "o____", "g__uncultured_bacterium", "g__uncultured", "f__uncultured", "f__uncultured_bacterium", "o__uncultured", "o__uncultured_bacterium","o__uncultured_planctomycete", "o__uncultured_soil_bacterium", "c__uncultured", "c__uncultured_bacterium", "c__metagenome", "f__uncultured_archaeon", "f__uncultured_bacterium", "f__uncultured_euryarchaeote", "c__uncultured_bacterium Class", "f__metagenome Family", "o__metagenome", "o__uncultured_Chloroflexi_bacterium", "f__metagenome", "f__uncultured_Acidobacteria_bacterium", "f__uncultured_soil_bacterium", "f__uncultured_actinobacterium", "f__uncultured_cyanobacterium", "f__uncultured_delta_proteobacterium", "f__uncultured_gamma_proteobacterium", "f__uncultured_organism", "g__metagenome", "g__uncultured_Acidobacteria_bacterium", "g__uncultured_Cytophagales_bacterium", "g__uncultured_soil_bacterium", "g__uncultured_Bacteroidetes_bacterium", "g__uncultured_Chloroflexi_bacterium", "g__uncultured_organism", "g__uncultured_planctomycete"),
  sep = " ", anon_unique = TRUE,
  suffix_rank = "classified"
 )

The genus names look like that right now:

ps_filt_fixed@tax_table %>%
  data.frame() %>%
  arrange(Genus)%>%
  head(n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
Domain Phylum Class Order Family Genus Species ASV
k__Bacteria p__Nitrospirota c__4-29-1 c__4-29-1 Class c__4-29-1 Class c__4-29-1 Class c__4-29-1 Class 1632943206
k__Bacteria p__Actinobacteriota c__Acidimicrobiia c__Acidimicrobiia Class c__Acidimicrobiia Class c__Acidimicrobiia Class c__Acidimicrobiia Class 1632943971
k__Bacteria p__Actinobacteriota c__Acidimicrobiia c__Acidimicrobiia Class c__Acidimicrobiia Class c__Acidimicrobiia Class c__Acidimicrobiia Class 1632943272
k__Bacteria p__Actinobacteriota c__Acidimicrobiia c__Acidimicrobiia Class c__Acidimicrobiia Class c__Acidimicrobiia Class c__Acidimicrobiia Class 1632942591
k__Bacteria p__Acidobacteriota c__Acidobacteriae c__Acidobacteriae Class c__Acidobacteriae Class c__Acidobacteriae Class c__Acidobacteriae Class 163294725
k__Bacteria p__Acidobacteriota c__Acidobacteriae c__Acidobacteriae Class c__Acidobacteriae Class c__Acidobacteriae Class c__Acidobacteriae Class 1632943056
k__Bacteria p__Actinobacteriota c__Actinobacteria c__Actinobacteria Class c__Actinobacteria Class c__Actinobacteria Class c__Actinobacteria Class 1632942355
k__Bacteria p__Proteobacteria c__Alphaproteobacteria c__Alphaproteobacteria Class c__Alphaproteobacteria Class c__Alphaproteobacteria Class c__Alphaproteobacteria Class 1632941537
k__Bacteria p__Proteobacteria c__Alphaproteobacteria c__Alphaproteobacteria Class c__Alphaproteobacteria Class c__Alphaproteobacteria Class c__Alphaproteobacteria Class 1632942154
k__Bacteria p__Proteobacteria c__Alphaproteobacteria c__Alphaproteobacteria Class c__Alphaproteobacteria Class c__Alphaproteobacteria Class s____ 1632942601

Quality control

It is highly recommended to have both negative and positive controls in the sequenced set of samples. Negative controls are crucial for identifying contaminants from reagents (PCR reagents, extraction kits), used equipment or introduced during sample preparation. Typically, they are no-template controls where water used during DNA extraction or PCR amplification is added instead. Negative controls are expected to have amount of reads close to zero. Positive controls are important for ensuring the accuracy and reliability of the sequencing process. In this dataset, mock communities are used as positive controls. Mock communities are synthetic communities of known composition, in this case typical for human gut, used to benchmark NG-Tax (Mock 3 and 4).

Calculating sum of reads

The table below shows the number of reads per sample. It is arranged in ascending order. Visualizing the sum of reads per sample will help in case you have high number of samples.

sum_of_reads<- as.data.frame(sample_sums(ps_filt_fixed@otu_table)) %>% rownames_to_column(var = "sampleID") %>%
 dplyr::rename(sum_of_reads = `sample_sums(ps_filt_fixed@otu_table)`)%>%
  left_join((metadata %>% rownames_to_column(var = "sampleID")), by = "sampleID")
head((sum_of_reads %>% arrange(sum_of_reads)),n=20)%>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
sampleID sum_of_reads sample.description sample.name observation.unit.identifier collection.date season location depth
Water_5 14 Control_with_sterile_water_only_to_check Water_control Negative_Control summer NA
Water_3 138 Control_with_sterile_water_only_to_check Water_control Negative_Control summer NA
Water_1 202 Control_with_sterile_water_only_to_check Water_control Negative_Control summer NA
Water_2 724 Control_with_sterile_water_only_to_check Water_control Negative_Control summer NA
Water_4 1259 Control_with_sterile_water_only_to_check Water_control Negative_Control summer NA
1_CW2_D_D93_1 4686 Lib1_CW2-biochar_D_D93_30_Jan_23_B/B1filled 15_Lib1_CW2-biochar_D_D93 CW2-biochar 1/30/2023 winter D 93
2_CW3_D_50_2 5829 Lib2_CW3-biochar_D_50_20_June_23_Sandfilled 57_Lib2_CW3-biochar_D_50 CW3-biochar 6/20/2023 summer D 50
1_CWI_D_D90_2 6018 Lib1_CW1-sand_D_D90_30_Jan_23_B/B2filled 32_Lib1_CW1-sand_D_D90 CW1-sand 1/30/2023 winter D 90
1_CWI_D_D70_2 6614 Lib1_CW1-sand_D_D70_30_Jan_23_B/B2filled 30_Lib1_CW1-sand_D_D70 CW1-sand 1/30/2023 winter D 70
2_CW1_A_50_2 8087 Lib2_CW1-sand_A_50_20_June_23_B/B2filled 34_Lib2_CW1-sand_A_50 CW1-sand 6/20/2023 summer A 50
2_CW3_A_90_1 9219 Lib2_CW3-biochar_A_90_20_June_23_Sandfilled 54_Lib2_CW3-biochar_A_90 CW3-biochar 6/20/2023 summer A 90
Mock3 9307 Mock_3_Community_experimental Mock_3_Community Mock_Community summer NA
2_CW1_D_70_1 9434 Lib2_CW1-sand_D_70_20_June_23_B/B2filled 40_Lib2_CW1-sand_D_70 CW1-sand 6/20/2023 summer D 70
1_CW1_A_D90_1 9448 Lib1_CW1-sand_A_D90_30_Jan_23_Sandfilled 6_Lib1_CW1-sand_A_D90 CW1-sand 1/30/2023 winter A 90
2_CW1_D_50_2 9919 Lib2_CW1-sand_D_50_20_June_23_B/B2filled 39_Lib2_CW1-sand_D_50 CW1-sand 6/20/2023 summer D 50
2_CW2_D50_1 10470 Lib2_CW2-biochar_D50__20_June_23_Sandfilled 49_Lib2_CW2-biochar_D50_ CW2-biochar 6/20/2023 summer D 50
1_CW3_A_D70_2 11188 Lib1_CW3-biochar_A_D70_30_Jan_23_B/B2filled 19_Lib1_CW3-biochar_A_D70 CW3-biochar 1/30/2023 winter A 70
2_CW1_A70_1 12823 Lib2_CW1-sand_A70__20_June_23_B/B2filled 37_Lib2_CW1-sand_A70_ CW1-sand 6/20/2023 summer A 70
1_CW3_D_D50_1 12976 Lib1_CW3-biochar_D_D50_30_Jan_23_B/B2filled 22_Lib1_CW3-biochar_D_D50 CW3-biochar 1/30/2023 winter D 50
2_CW3_A_70_2 13035 Lib2_CW3-biochar_A_70_20_June_23_Sandfilled 53_Lib2_CW3-biochar_A_70 CW3-biochar 6/20/2023 summer A 70
sum_of_reads %>% arrange(sum_of_reads) %>% filter(!sampleID %in% c("refmock3", "refmock4"))%>%
  ggplot(aes(x = reorder(sampleID, sum_of_reads), y = sum_of_reads, fill = observation.unit.identifier)) +
  geom_point(shape = 21, size = 3, color = "black")+
  scale_fill_nejm()+
  labs(x = "sample ID")+
  theme_bw()+
  theme(axis.text.x=element_text(angle=90))

As shown on the plot and in the table -negative controls have the lowest read count read count, which is 14, 138, 208, 724 and 1259. Since all experimental samples have higher read count than controls - none of the samples is discarded.

Inspecting positive control - mock 3

First, a check will be done to see whether the mocks that were processed and sequenced in the laboratory, resemble the expected (theoretical) values and can reproduce the theoretical composition. If NG-Tax pipeline was used - the composition of theoretical mocks is added during the NG-Tax sample processing (provided this option was marked in the NG-Tax options) and is present in the output biom file. If another pipeline or another mocks are used, the expected abundances of taxa in these mocks needs to be attached to the abundance table of the sequenced mocks. We advice to work on genus level and attach it to the ‘melted’ phyloseq data frame (the object created with `ps_melt’ function - mock3_melt_Genus_rel or mock4_melt_rel), which is the so called long table. A function that might be useful is the ?bind_rows from dplyr package. The names of the columns in the table that will be attached needs to be the same as the names in the melted phyloseq. Mock 3 and 4 are different in composition therefore they need to be analyzed separately. To do it a separate phyloseq with mock3 samples will be created. Then a barplot will be printed to visually assess similarity of the composition of experimental and theoretical mocks. In the code we select the top taxa top plot (n=25), so that the plot is not too crowded.

ps_mock3<- ps_filt_fixed %>% ps_filter(sample.name == "Mock_3_Community")
ps_mock3_Genus_rel<- ps_mock3 %>% tax_transform("compositional", rank = "Genus")
top_taxa_mock3<- tax_top(ps_mock3_Genus_rel, n=25)
mock3_melt_Genus_rel<- ps_mock3_Genus_rel %>% psmelt() %>%
  mutate(Genus_top = case_when(Genus %in% top_taxa_mock3 ~ Genus,
                                                      TRUE ~ "other")) %>%
  mutate(Genus_top = as.factor(Genus_top)) %>%
  mutate(Genus_top = fct_relevel(Genus_top, c(top_taxa_mock3, "other")))

mock3_melt_Genus_rel %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Genus_top)) +
       geom_bar(stat = "identity", color = "black")+
       scale_fill_manual(values = distinct_palette(25, pal = "brewerPlus", add = "grey70"))+
       theme_bw()+
       theme(axis.title.x = element_blank(), legend.position = "right")

Apart from barplots it is also good to see how the abundance of genera in experimental mocks correlate with the abundance in theoretical mocks. To do so, You can plot scatter plots and inspect the Spearman correlation coefficient (they are shown at the intersection of the theoretical and experimental mock abundances). Additionally density plots are shown for each variable.

mock3_melt_Genus_rel %>% select(OTU, Sample, Abundance) %>%
  pivot_wider(names_from = Sample, names_prefix = "Abundance_", values_from = "Abundance") %>%
  column_to_rownames(var = "OTU")%>% ggpairs(upper = list(continuous = wrap("cor", method = "spearman")))

Inspecting positive control - mock 4

Now the same will be done with mock 4

ps_mock4<- ps_filt_fixed %>% ps_filter(sample.name == "Mock_4_Community")

ps_mock4_Genus_rel<- ps_mock4 %>% tax_transform("compositional", rank = "Genus")
top_taxa_mock4<- tax_top(ps_mock4_Genus_rel, n=25)
mock4_melt_Genus_rel<- ps_mock4_Genus_rel %>% psmelt() %>%
  mutate(Genus_top = case_when(Genus %in% top_taxa_mock4 ~ Genus,
                                                      TRUE ~ "other")) %>%
  mutate(Genus_top = as.factor(Genus_top)) %>%
  mutate(Genus_top = fct_relevel(Genus_top, c(top_taxa_mock4, "other")))

mock4_melt_Genus_rel %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Genus_top)) +
       geom_bar(stat = "identity", color = "black")+
       scale_fill_manual(values = distinct_palette(25, pal = "brewerPlus", add = "grey70"))+
       theme_bw()+
       theme(axis.title.x = element_blank(), legend.position = "right")

and correlation

mock4_melt_Genus_rel %>% select(OTU, Sample, Abundance) %>%
  pivot_wider(names_from = Sample, names_prefix = "Abundance_", values_from = "Abundance") %>%
  column_to_rownames(var = "OTU")%>% ggpairs(upper = list(continuous = wrap("cor", method = "spearman")))

Rare taxa exploration

Rare taxa refer to taxa present at low relative abundance or in a small number of samples across the data set. Rare biosphere implicates low-abundance taxa as important contributors to the assessment of a- and b-diversity (Lynch et al., 2015). However, recent microbiome quality control studies indicate that many rare taxa might be created by sequencing artifacts, errors and/or contamination in SSU rRNA gene surveys. To address the challenge of identifying the “true” origin of these taxa, various approaches including trimming, filtering and/or contaminant removal have been proposed in literature (Cao et al.,2021, Reitmeier et al., 2021, Nikodemova et al., 2023, Davis et al., 2018).

NG-Tax 2.0 follows an error-handling strategy which filters artifacts and corrects for the impact of error-reads on the relative abundance estimates. By default, a user defined relative abundance cut-off is set to 0.1% of the total number of read-pairs associated with ASVs (Poncheewin et al., 2019). Thus, the output of NG-Tax 2.0 does not need additional filtering.

Depending on the origin of the sample and/or the aim of the study, NG-Tax 2.0 parameters can be changed. In studies that aim to discover rare taxa or is expected that rare taxa contribute to a larger diversity, filtering would not be advisable.

Before making any filtering decisions, a nice way to check the taxa distribution in your data set is to visualize the prevalence-abundance distribution of ASVs. Keep in mind that the minimum detection threshold of NG-Tax is set at 0.1%.

ps_filt_fixed_samples<- ps_filt_fixed %>%
  ps_filter(!`observation.unit.identifier` %in% c("Mock_Community", "Negative_Control"))

taxa_prev_plot <- plot_taxa_prevalence(ps_filt_fixed_samples, "Phylum",
                                       detection = 0)
taxa_prev_plot

Check for singletons and doubletons

Another common practice when working with this type of data is to check the number of ASVs represented by only one (singletons) or two (doubletons) sequences respectively, across all samples analyzed.

NG-Tax 2.0 error-handling assumes that singleton read-pairs that did not match existing ASV object due to a single mismatch (mutation or indel) is probably due to a random sequence error and is joined with the particular ASV. Singletons showing more than one mismatch are considered as sample specific noise and discarded.

length(which(taxa_sums(ps_filt_fixed_samples) <= 1))
## [1] 0
length(which(taxa_sums(ps_filt_fixed_samples) <= 2))
## [1] 0

In this dataset, there are no singletons (because NG-Tax filters them out) or doubletons so there is no need for filtering.

In case other bioinformatic pipelines were used for ASV generation, different approaches on how to deal with singletons and/or doubletons might be necessary before downstream analysis.

Find out most or least abundant taxa in dataset

Let’s inspect the total abundance of each taxon (ASV) across all samples in the dataset. This is achieved with the taxa_sums function. To find out which taxa are the least abundant across the dataset, use the following code.

taxa_sums(ps_filt_fixed_samples) %>%
  data.frame() %>%
  dplyr::rename(sum = ".") %>%
  arrange(sum)%>%
  head(n=20) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "200px", height = "300px")
sum
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5

Data normalization

Microbiome data is inherently compositional- meaning that sequencing only provides information on the relative abundance of taxa and that the observed abundance of each taxon is dependent on the observed abundances of all other taxa- and this poses challenges to downstream statistical analysis. To address this, normalization is employed to scale the data to a common and comparable level. Thanks to data normalization, the differential abundance analysis is not confounded by the differences in the sampling fractions (which is unknown and differs per sample). Failure to normalize the data will result in a systematic bias that increases the false discovery rate (FDR) and also possible loss of power in some cases.

Below we show two methods of normalization, however other common approaches for normalization (such as scaling of counts, parameter estimation, etc.) may be applied. as exists. To know more, here is a nice overview of the available normalization methods and how they perform on comparing ecological communities by McKnight et al (2019) and by Lin & Peddada (2020).

Total-sum scaling

A common method is Total-Sum Scaling (TSS) (or compositional) which transforms the absolute ASV counts into relative abundance values. For more information, check out the articles: Zhou et al. (2023) and Weiss et al. (2017).

ps_filt_fixed_samples_rel<- tax_transform(ps_filt_fixed_samples, trans = "compositional")
head(ps_filt_fixed_samples_rel@otu_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
1_CW1_A_D30_1 1_CW1_A_D50_1 1_CW1_A_D50_2 1_CW1_A_D70_1 1_CW1_A_D70_2 1_CW1_A_D90_1 1_CW1_D_D30_1 1_CW1_D_D50_1 1_CW2_A_D50_2 1_CW2_A_D70_1 1_CW2_A_D93_1 1_CW2_A_D93_2 1_CW2_D_D50_1 1_CW2_D_D70_1 1_CW2_D_D93_1 1_CW3_A_D50_1 1_CW3_A_D50_2 1_CW3_A_D70_1 1_CW3_A_D70_2 1_CW3_A_D93_1 1_CW3_A_D93_2 1_CW3_D_D50_1 1_CW3_D_D70_1 1_CW3_D_D93_1 1_CW3_D_D93_2 1_CW3_D_D93_3 1_CW3_D50_1 1_CW3_D70_1 1_CWI_D_D70_1 1_CWI_D_D70_2 1_CWI_D_D90_1 1_CWI_D_D90_2 2_CW1_A_50_1 2_CW1_A_50_2 2_CW1_A_90_1 2_CW1_A_90_2 2_CW1_A70_1 2_CW1_D_50_1 2_CW1_D_50_2 2_CW1_D_70_1 2_CW1_D90_1 2_CW2_A_50_1 2_CW2_A_50_2 2_CW2_A_70_1 2_CW2_A_90_1 2_CW2_D_50_1 2_CW2_D_70_1 2_CW2_D_70_2 2_CW2_D50_1 2_CW3_A_50_1 2_CW3_A_50_2 2_CW3_A_70_1 2_CW3_A_70_2 2_CW3_A_90_1 2_CW3_A_90R_1 2_CW3_D_50_1 2_CW3_D_50_2 2_CW3_D_70_1 2_CW3_D_90_1 2_CW3_D_90_2
1632942152 0 0.000000000 0.000000000 0 0 0.000000000 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00116556 0.001072578 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
1632941008 0 0.000000000 0.000000000 0 0 0.001587638 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
1632942677 0 0.000000000 0.000000000 0 0 0.000000000 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.007094396 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
1632941543 0 0.000000000 0.000000000 0 0 0.000000000 0 0 0 0.001680672 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
1632942189 0 0.000000000 0.000000000 0 0 0.000000000 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.002681444 0.00349239 0.00150585 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
1632943887 0 0.000000000 0.000000000 0 0 0.000000000 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.001646132
1632942431 0 0.000000000 0.000000000 0 0 0.000000000 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.001100968 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
163294897 0 0.000000000 0.000000000 0 0 0.011113463 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.001607518 0.01642879 0.01381361 0 0 0.001411433 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
163294951 0 0.000000000 0.000000000 0 0 0.002434378 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000
163294410 0 0.002788586 0.001594434 0 0 0.000000000 0 0 0 0.000000000 0 0 0 0 0 0 0 0.00000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0.000000000 0.000000000 0.00000000 0.00000000 0 0 0.000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000000

Rarefaction

Another method of data normalization is rarefaction. It addresses another challenge with 16S sequenced microbial datasets, which is the variation between the total number of sequencing reads generated for each sample. Rarefaction is random subsampling to an even sampling depth. Sampling is usually done without replacement. Rare sequences are excluded more often when sampling with replacement (Cameron et al., 2021). Rarefaction controls the effect of uneven sequencing effort (library size) across samples in a study. This normalization method is widely used, however not always advised and for example authors of phyloseq R package do not advocate using this as a normalization procedure (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531). There are also opposite views, as described in this article. This kind of normalization might be dataset specific, as described here, therefore it should be decided cautiously whether to rarefy or not. Given the fact that this method is highly debatable, consider applying it only in cases when there is highly uneven sequencing depth across libraries. The following part indicates how to apply rarefaction and guide the user through decision making steps. However, for the purpose of this tutorial it was decided to proceed with non-rarefied data for downstream analysis.

Decision Making

First, the distribution of total sequencing reads in the different samples has to be assessed. The scatter plot in the Quality Control section shows the total number of reads per sample. The variation in total reads sequenced between samples may introduce a bias when studying microbial community composition/diversity. Rarefying samples with high read counts to a certain read depth may alleviate the limitation imposed by total read count variation between samples.To make an educated decision on the rarefaction threshold, let’s explore the data through plotting rarefaction curves and histograms of read sums.

ps_filt_fixed_samples %>%
  ggrare(step = 50, color = "observation.unit.identifier", se = FALSE)

sum_of_reads %>% filter(!sampleID %in% c("refmock3", "refmock4")) %>%
  ggplot(aes(x=sum_of_reads)) +
  geom_histogram(color = 'black', binwidth = 500, aes(fill = observation.unit.identifier))+
  scale_fill_nejm()+
  ggtitle("Histogram of Read Sums (binwidth = 500)")+
  labs(x = "Read sums")+
  theme_bw()+
  theme(legend.position = "bottom", legend.title = element_blank())

sum_of_reads %>% filter(!sampleID %in% c("refmock3", "refmock4")) %>%
  filter(sum_of_reads<20000)%>%
  ggplot(aes(x=sum_of_reads)) +
  geom_histogram(color = 'black', binwidth = 1000, aes(fill = observation.unit.identifier))+
  scale_fill_nejm()+
  ggtitle("Histogram of Read Sums < 20000 Reads (binwidth = 500)")+
  labs(x = "Read sums")+
  theme_bw()+
  theme(legend.position = "bottom", legend.title = element_blank())

sum_of_reads %>% filter(!observation.unit.identifier %in% c("Mock_Community", "Negative_Control")) %>%
  bind_cols(estimate_richness(ps_filt_fixed_samples, measures = "Observed")) %>%
  remove_rownames() %>%
  select(sampleID, sum_of_reads, Observed) %>% arrange(sum_of_reads, Observed) %>% 
  kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
sampleID sum_of_reads Observed
1_CW2_D_D93_1 4686 316
2_CW3_D_50_2 5829 395
1_CWI_D_D90_2 6018 459
1_CWI_D_D70_2 6614 385
2_CW1_A_50_2 8087 403
2_CW3_A_90_1 9219 272
2_CW1_D_70_1 9434 397
1_CW1_A_D90_1 9448 280
2_CW1_D_50_2 9919 330
2_CW2_D50_1 10470 392
1_CW3_A_D70_2 11188 339
2_CW1_A70_1 12823 432
1_CW3_D_D50_1 12976 243
2_CW3_A_70_2 13035 272
1_CW3_D_D93_2 13140 282
1_CW1_A_D50_2 13798 397
1_CW3_D50_1 14322 256
1_CW1_A_D70_2 14898 388
2_CW2_A_90_1 15009 196
1_CW1_A_D50_1 15420 389
2_CW3_A_50_2 15618 391
1_CW3_D_D70_1 15896 307
1_CW2_A_D50_2 15992 159
1_CW1_D_D50_1 16149 331
1_CW2_D_D70_1 16410 277
1_CW3_A_D50_1 16419 301
1_CW3_A_D93_2 17266 309
2_CW2_D_70_1 17323 384
1_CWI_D_D90_1 17329 416
2_CW2_D_50_1 17597 377
1_CWI_D_D70_1 18044 404
1_CW1_D_D30_1 18635 414
1_CW3_D_D93_3 18829 281
1_CW2_D_D50_1 18902 269
1_CW3_D70_1 19089 289
1_CW3_A_D70_1 19733 375
2_CW3_D_90_2 20047 174
1_CW2_A_D70_1 20230 272
2_CW1_A_90_2 20849 153
1_CW3_D_D93_1 21799 292
2_CW3_A_50_1 22419 402
2_CW3_D_70_1 22746 297
2_CW1_A_90_1 23191 152
1_CW2_A_D93_2 23318 324
2_CW2_A_50_2 23786 295
1_CW1_A_D30_1 24377 368
2_CW2_D_70_2 24403 282
2_CW1_D90_1 24538 193
1_CW3_A_D50_2 24689 295
2_CW3_A_90R_1 24836 279
1_CW1_A_D70_1 26491 342
1_CW3_A_D93_1 27202 293
1_CW2_A_D93_1 28109 307
2_CW1_D_50_1 28328 302
2_CW2_A_70_1 29918 253
2_CW1_A_50_1 32279 332
2_CW3_D_90_1 32684 186
2_CW3_A_70_1 33363 271
2_CW3_D_50_1 33804 383
2_CW2_A_50_1 48326 287

The table above shows a summary of taxa richness (number of Observed ASVs) and total reads per sample. To choose the rarefaction depth the safest approach it the smallest sequencing depth, in the case of this dataset is the read sum of sample 1_CW2_D_D93_1 (324 ASVs).

Plotting rarefaction curves illustrates that the sample 1_CW2_A_D50_2 is the least ASV-rich sample (161 ASVs) with 15,997 reads in total.

Plotting histograms, and inspecting read sum distribution illustrates that there is a gap in read sum distribution between 13,024 and 11,411. Therefore 13,024 emerges as a good cut-off for rarefaction.

On one hand, cutoffs 5,198 and 13,024 can attenuate the ASV richness variation between high read count samples and the least ASV rich sample “1_CW2_A_D50_2” due to library size variation. On the other hand, 5,198 could be a too conservative rarefaction cutoff leading to loss of species richness in samples with higher read counts, but with less ASV richness, such as the sample 1_CW2_A_D50_2 with 15,997 reads. As a result, we decide to use 13,024 reads as our rarefaction cut-off.

Rule of thumb: Rarefy to a depth that retains most of the data without sacrificing too much biological information (samples and/or ASVs).

Rarefying

The code below rarefies the dataset to a depth of 13,024 reads and plots rarefaction curves.

ps_filt_fixed_samples_raref<- rarefy_even_depth(ps_filt_fixed_samples, sample.size = 13024, replace = FALSE)

ps_filt_fixed_samples_raref %>% ggrare(step = 50, color = "observation.unit.identifier", se = FALSE)
## rarefying sample 1_CW1_A_D30_1
## rarefying sample 1_CW1_A_D50_1
## rarefying sample 1_CW1_A_D50_2
## rarefying sample 1_CW1_A_D70_1
## rarefying sample 1_CW1_A_D70_2
## rarefying sample 1_CW1_D_D30_1
## rarefying sample 1_CW1_D_D50_1
## rarefying sample 1_CW2_A_D50_2
## rarefying sample 1_CW2_A_D70_1
## rarefying sample 1_CW2_A_D93_1
## rarefying sample 1_CW2_A_D93_2
## rarefying sample 1_CW2_D_D50_1
## rarefying sample 1_CW2_D_D70_1
## rarefying sample 1_CW3_A_D50_1
## rarefying sample 1_CW3_A_D50_2
## rarefying sample 1_CW3_A_D70_1
## rarefying sample 1_CW3_A_D93_1
## rarefying sample 1_CW3_A_D93_2
## rarefying sample 1_CW3_D_D70_1
## rarefying sample 1_CW3_D_D93_1
## rarefying sample 1_CW3_D_D93_2
## rarefying sample 1_CW3_D_D93_3
## rarefying sample 1_CW3_D50_1
## rarefying sample 1_CW3_D70_1
## rarefying sample 1_CWI_D_D70_1
## rarefying sample 1_CWI_D_D90_1
## rarefying sample 2_CW1_A_50_1
## rarefying sample 2_CW1_A_90_1
## rarefying sample 2_CW1_A_90_2
## rarefying sample 2_CW1_D_50_1
## rarefying sample 2_CW1_D90_1
## rarefying sample 2_CW2_A_50_1
## rarefying sample 2_CW2_A_50_2
## rarefying sample 2_CW2_A_70_1
## rarefying sample 2_CW2_A_90_1
## rarefying sample 2_CW2_D_50_1
## rarefying sample 2_CW2_D_70_1
## rarefying sample 2_CW2_D_70_2
## rarefying sample 2_CW3_A_50_1
## rarefying sample 2_CW3_A_50_2
## rarefying sample 2_CW3_A_70_1
## rarefying sample 2_CW3_A_70_2
## rarefying sample 2_CW3_A_90R_1
## rarefying sample 2_CW3_D_50_1
## rarefying sample 2_CW3_D_70_1
## rarefying sample 2_CW3_D_90_1
## rarefying sample 2_CW3_D_90_2

Data transformation

Normalized data still exhibit skewed distributions, unequal variances, and extreme values, which may limit their effectiveness in situations with significant heterogeneity. Therefore, transforming data might be necessary before performing certain statistical tests (e.g. differential abundance analysis). The selection of the appropriate transformation method depends on the chosen statistical methods. It also helps when illustrating patterns involving both low and high abundance taxa. There are different types of data transformations, such as additive log-ratio (alr), centered log-ratio (clr) and robust centered log-ratio (rclr).

See also Nearing et al.(2022) and Wang et al. (2024).

ps_filt_fixed_samples_rel_rclr<- tax_transform(ps_filt_fixed_samples_rel, trans = "rclr", chain = TRUE)
head(ps_filt_fixed_samples_rel_rclr@otu_table, n=10) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "800px", height = "300px")
1_CW1_A_D30_1 1_CW1_A_D50_1 1_CW1_A_D50_2 1_CW1_A_D70_1 1_CW1_A_D70_2 1_CW1_A_D90_1 1_CW1_D_D30_1 1_CW1_D_D50_1 1_CW2_A_D50_2 1_CW2_A_D70_1 1_CW2_A_D93_1 1_CW2_A_D93_2 1_CW2_D_D50_1 1_CW2_D_D70_1 1_CW2_D_D93_1 1_CW3_A_D50_1 1_CW3_A_D50_2 1_CW3_A_D70_1 1_CW3_A_D70_2 1_CW3_A_D93_1 1_CW3_A_D93_2 1_CW3_D_D50_1 1_CW3_D_D70_1 1_CW3_D_D93_1 1_CW3_D_D93_2 1_CW3_D_D93_3 1_CW3_D50_1 1_CW3_D70_1 1_CWI_D_D70_1 1_CWI_D_D70_2 1_CWI_D_D90_1 1_CWI_D_D90_2 2_CW1_A_50_1 2_CW1_A_50_2 2_CW1_A_90_1 2_CW1_A_90_2 2_CW1_A70_1 2_CW1_D_50_1 2_CW1_D_50_2 2_CW1_D_70_1 2_CW1_D90_1 2_CW2_A_50_1 2_CW2_A_50_2 2_CW2_A_70_1 2_CW2_A_90_1 2_CW2_D_50_1 2_CW2_D_70_1 2_CW2_D_70_2 2_CW2_D50_1 2_CW3_A_50_1 2_CW3_A_50_2 2_CW3_A_70_1 2_CW3_A_70_2 2_CW3_A_90_1 2_CW3_A_90R_1 2_CW3_D_50_1 2_CW3_D_50_2 2_CW3_D_70_1 2_CW3_D_90_1 2_CW3_D_90_2
1632942152 0 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0.0000000 0 0 0 0 0 0 0 -0.5943949 -0.7687571 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
1632941008 0 0.0000000 0.0000000 0 0 -0.5779372 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
1632942677 0 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 1.051638 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
1632941543 0 0.0000000 0.0000000 0 0 0.0000000 0 0 0 -0.3715704 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
1632942189 0 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.1475337 0.3636965 -0.4370147 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
1632943887 0 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.5417619
1632942431 0 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 -0.7354975 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
163294897 0 0.0000000 0.0000000 0 0 1.3679730 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 -0.2220174 1.301646 1.148943 0 0 -0.54112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
163294951 0 0.0000000 0.0000000 0 0 -0.1504932 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000
163294410 0 0.2871654 -0.2571683 0 0 0.0000000 0 0 0 0.0000000 0 0 0 0 0 0 0 0.0000000 0.0000000 0.0000000 0.0000000 0 0 0.0000000 0 0 0 0 0 0 0 0 0.000000 0.0000000 0.000000 0.000000 0 0 0.00000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0000000

Aggregating data at specific taxonomic rank

It is possible to group the ASVs in the phyloseq object at different taxonomic ranks by aggregating the counts and inspect the data depending on your research question.

ps_filt_fixed_samples_phylum<- tax_agg(ps_filt_fixed_samples, rank = "Phylum")
print(ps_filt_fixed_samples_phylum)
## psExtra object - a phyloseq object with extra slots:
## 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 45 taxa and 60 samples ]
## sample_data() Sample Data:       [ 60 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 45 taxa by 2 taxonomic ranks ]
## 
## psExtra info:
## tax_agg = "Phylum"

To aggregate ASV counts at other taxonomic levels, set the desired rank and create a summary of the number of taxa at the different ranks.

ps_filt_fixed_samples_class<- tax_agg(ps_filt_fixed_samples, rank = "Class")
ps_filt_fixed_samples_order<- tax_agg(ps_filt_fixed_samples, rank = "Order")
ps_filt_fixed_samples_family<- tax_agg(ps_filt_fixed_samples, rank = "Family")
ps_filt_fixed_samples_genus<- tax_agg(ps_filt_fixed_samples, rank = "Genus")

summary_agg<- data.frame(Taxa = c(colnames(ps_filt_fixed_samples@tax_table)[c(2:6)], "ASVs"),
                 number_of_taxa=c(dim(ps_filt_fixed_samples_phylum@otu_table)[1], dim(ps_filt_fixed_samples_class@otu_table)[1], dim(ps_filt_fixed_samples_order@otu_table)[1], dim(ps_filt_fixed_samples_family@otu_table)[1], dim(ps_filt_fixed_samples_genus@otu_table)[1], dim(ps_filt_fixed_samples@otu_table)[1]))
print(summary_agg)
##     Taxa number_of_taxa
## 1 Phylum             45
## 2  Class            129
## 3  Order            291
## 4 Family            466
## 5  Genus            741
## 6   ASVs           3772

Downstream analysis

Data exploration

It is always good idea to visualize all of the samples and quickly verify whether everything looks as expected (no strange results, samples with mainly one taxon present, if we do not expect it)

Here, we will explore microbial community composition (at order level) grouped by season and observation unit.

To vizualize data very often barplots and heatmaps showing individual samples are used.

In these barplots, the depth of sampling is shown on x axis.

barplots_season<- ps_filt_fixed_samples%>%
  ps_mutate(season = as.factor(season)) %>%
  ps_mutate(season = fct_relevel(season, c("winter", "summer"))) %>%
  ps_arrange(season, depth)%>%
  comp_barplot(group_by = "season", tax_level = "Order", tax_transform_for_plot = "compositional", n_taxa = 25, bar_width = 0.8, merge_other = FALSE, label = "depth", sample_order = "default")

barplots_season_wrapped<-wrap_plots(barplots_season) +
  plot_layout(guides = 'collect', nrow = 2)

barplots_season_wrapped&
  facet_wrap("observation.unit.identifier", scales = "free_x")&
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 10), axis.ticks.y = element_blank(), legend.title = element_blank(), legend.position = "bottom", strip.text = element_text(size=12),
        legend.key.size = unit(0.4, 'cm'),
        legend.text = element_text(size=10))

On the heatmap below all samples and all taxa are plotted. The color indicates the abundance of taxa. Similar samples based on their taxa composition are clustered together (based on hierarchical clustering). The size of the heatmap was too big because taxa were too many to visualize and for this we applied an additional step of filtering based on prevalence. Only taxa belonging to an order that have over 100 counts in at least 5% of samples are visualized. This heatmap gives also the possibility to annotate samples and taxa with additional information.

cols_oui<- pal_nejm("default")(3)
names(cols_oui) <- unique(samdat_tbl(ps_filt_fixed_samples)$`observation.unit.identifier`)
cols_season<- c("lightblue", "yellow")
names(cols_season) <- unique(samdat_tbl(ps_filt_fixed_samples)$season)
cols_location<- c("black", "grey")
names(cols_location) <- unique(samdat_tbl(ps_filt_fixed_samples)$location)
cols_depth<- brewer.pal(5, name = "YlOrRd")
names(cols_depth) <- unique(samdat_tbl(ps_filt_fixed_samples)$depth)

ps_filt_fixed_samples %>%
  tax_transform("compositional", rank = "Order") %>%
  tax_filter(min_prevalence = 0.1, undetected = 500, tax_level = "Order") %>%
  comp_heatmap(colors = heat_palette(sym = FALSE, "Blue-Red 3"),
               tax_anno = taxAnnotation(
      Prev. = anno_tax_prev(bar_width = 0.3, size = grid::unit(1, "cm"))
    ),
    sample_anno = sampleAnnotation(
      Depth = anno_sample("depth"),
      Season = anno_sample("season"),
      Location = anno_sample("location"),
      Group = anno_sample("observation.unit.identifier"),
      col = list(Group = cols_oui, Season = cols_season, Location = cols_location, Depth = cols_depth), border = FALSE))

Alpha diversity calculation

Another debatable topic is whether to calculate alpha-diversity metrics on rarefied or non-rarefied data. Richness-based metrics like Observed ASVs, phylogenetic diversity (PD) and Chao1, as well as Shannon (Schloss PD, 2024) are sensitive to sequencing depth variations, because one sample might show higher number of ASVs just because it was sequenced more which makes rarefaction necessary. Considering these limitations, we decided to perform calculations on non-rarefied data and focus on metrics less sensitive to sequencing dept variations. Here, we calculated 4 indices that depend on abundance/prevalence of taxa and 1 index which incorporates phylogenetic differences between species (Faith’s phylogenetic diversity).

alpha_div<- estimate_richness(ps_filt_fixed_samples, measures = c("Observed", "Shannon", "Simpson", "InvSimpson"))
PD<- calculatePD(ps_filt_fixed_samples, justDF = TRUE) %>% pull(PD)
alpha_div<- alpha_div %>% mutate(PD = PD)
rownames(alpha_div) <- gsub("^X", "", rownames(alpha_div))
head(alpha_div, n=20) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "700px", height = "300px")
Observed Shannon Simpson InvSimpson PD
1_CW1_A_D30_1 368 5.612523 0.9944924 181.56645 29.64795
1_CW1_A_D50_1 389 5.715326 0.9953557 215.31554 34.68714
1_CW1_A_D50_2 397 5.740467 0.9955132 222.87770 34.88421
1_CW1_A_D70_1 342 5.366343 0.9900399 100.40064 30.88764
1_CW1_A_D70_2 388 5.588234 0.9934239 152.06686 35.20985
1_CW1_A_D90_1 280 5.340239 0.9926522 136.09582 28.01250
1_CW1_D_D30_1 414 5.789785 0.9957194 233.61273 30.77747
1_CW1_D_D50_1 331 4.931737 0.9684862 31.73210 29.92374
1_CW2_A_D50_2 159 4.173833 0.9688446 32.09715 15.26653
1_CW2_A_D70_1 272 5.100249 0.9894741 95.00413 24.87157
1_CW2_A_D93_1 307 4.829395 0.9666682 30.00135 30.57384
1_CW2_A_D93_2 324 4.857781 0.9655586 29.03483 31.31262
1_CW2_D_D50_1 269 5.040975 0.9855748 69.32308 24.73393
1_CW2_D_D70_1 277 5.054119 0.9866257 74.77035 24.42412
1_CW2_D_D93_1 316 5.514564 0.9947579 190.76185 28.42853
1_CW3_A_D50_1 301 5.254104 0.9905444 105.75699 27.85874
1_CW3_A_D50_2 295 5.216545 0.9901605 101.63087 26.21226
1_CW3_A_D70_1 375 5.648368 0.9951015 204.14372 29.86439
1_CW3_A_D70_2 339 5.530080 0.9942026 172.49120 29.44992
1_CW3_A_D93_1 293 5.244995 0.9910261 111.43475 28.09640

To compare the alpha-diversity between groups, we first need to attach metadata to the table and perform statistics using the code below.

metadata_samples<- metadata %>%filter(!`observation.unit.identifier` %in% c("Mock_Community", "Negative_Control"))
metadata_samples_alpha<- bind_cols(metadata_samples, alpha_div)
metadata_samples_long<- metadata_samples_alpha%>% pivot_longer(cols= Observed:PD, names_to = "alpha_div", values_to = "value")
kruskal_test_alpha<- metadata_samples_long %>% group_by(alpha_div) %>%
    rstatix::kruskal_test(value ~ `observation.unit.identifier`) %>% mutate(p_adj_fdr = p.adjust(p, method = "fdr"))
kruskal_test_alpha %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12)
alpha_div .y. n statistic df p method p_adj_fdr
InvSimpson value 60 10.095457 2 0.00642 Kruskal-Wallis 0.0101625
Observed value 60 9.624578 2 0.00813 Kruskal-Wallis 0.0101625
PD value 60 8.232205 2 0.01630 Kruskal-Wallis 0.0163000
Shannon value 60 10.783981 2 0.00455 Kruskal-Wallis 0.0101625
Simpson value 60 10.095457 2 0.00642 Kruskal-Wallis 0.0101625

Based on the Kruskal-Wallis test results, alpha-diversity differs significantly between groups for all of the indices. Using the following code, post-hoc test (Dunn’s test) can be performed to look for pairwise differences in alpha-diversity between groups.

dunn_test_alpha<- metadata_samples_long %>% group_by(alpha_div) %>%
    rstatix::dunn_test(value ~ observation.unit.identifier, p.adjust.method = "fdr") %>%
  add_xy_position(x = "observation.unit.identifier", fun = "mean_sd", step.increase = 0.05, scales = "free")
dunn_test_alpha %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12)
alpha_div .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif y.position groups xmin xmax
InvSimpson value CW1-sand CW2-biochar 21 15 -2.9390920 0.0032918 0.0098753 ** 250.590550 CW1-sand…. 1 2
InvSimpson value CW1-sand CW3-biochar 21 24 -2.4501016 0.0142816 0.0214224
257.438875 CW1-sand…. 1 3
InvSimpson value CW2-biochar CW3-biochar 15 24 0.7944532 0.4269316 0.4269316 ns 264.287200 CW2-bioc…. 2 3
Observed value CW1-sand CW2-biochar 21 15 -2.6256333 0.0086488 0.0129732
437.178100 CW1-sand…. 1 2
Observed value CW1-sand CW3-biochar 21 24 -2.6950691 0.0070374 0.0129732
442.933750 CW1-sand…. 1 3
Observed value CW2-biochar CW3-biochar 15 24 0.2501066 0.8025049 0.8025049 ns 448.689400 CW2-bioc…. 2 3
PD value CW1-sand CW2-biochar 21 15 -2.4309614 0.0150588 0.0225882
34.853150 CW1-sand…. 1 2
PD value CW1-sand CW3-biochar 21 24 -2.4900241 0.0127734 0.0225882
35.186375 CW1-sand…. 1 3
PD value CW2-biochar CW3-biochar 15 24 0.2363063 0.8131950 0.8131950 ns 35.519600 CW2-bioc…. 2 3
Shannon value CW1-sand CW2-biochar 21 15 -3.0294263 0.0024502 0.0073506 ** 5.918350 CW1-sand…. 1 2
Shannon value CW1-sand CW3-biochar 21 24 -2.5459156 0.0108992 0.0163487
5.947375 CW1-sand…. 1 3
Shannon value CW2-biochar CW3-biochar 15 24 0.8002522 0.4235647 0.4235647 ns 5.976400 CW2-bioc…. 2 3
Simpson value CW1-sand CW2-biochar 21 15 -2.9390920 0.0032918 0.0098753 ** 0.999150 CW1-sand…. 1 2
Simpson value CW1-sand CW3-biochar 21 24 -2.4501016 0.0142816 0.0214224
0.999375 CW1-sand…. 1 3
Simpson value CW2-biochar CW3-biochar 15 24 0.7944532 0.4269316 0.4269316 ns 0.999600 CW2-bioc…. 2 3

The code below plots the alpha-diversity values and the post-hoc pairwise test results.

ggplot(metadata_samples_long, aes(x = observation.unit.identifier, y = value))+
  geom_boxplot()+
  geom_jitter(shape = 21, color = "black", size = 3, aes(fill = observation.unit.identifier))+
  facet_wrap(~alpha_div, scales = "free", ncol = 3)+
  scale_fill_nejm()+
  stat_pvalue_manual(data = dunn_test_alpha,
                     label = "p.adj.signif", tip.length = 0.01, bracket.shorten = 0.05, hide.ns = TRUE, size = 4)+
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  theme_bw()+
  theme(axis.text.x = element_text(size = 10), legend.position = "none")

Alpha-diversity is statistically significantly higher for CW1-samples compared to the other two groups. As mentioned above, these indices are sensitive to uneven sequencing depth. For this, interpretation should happen with great caution considering only diversity indices non-sensitive to variations in sequencing depth.

Beta-diversity analysis

To measure how distinct or similar microbial communities are across different environments, conditions or treatments, beta-diversity analysis is performed where several distance or dissimilarity metrics tailored to capture specific data characteristics are used. Common beta-diversity metrics: Bray-Curtis dissimilarity (compositional), Jaccard index (compositional)and Unifrac (phylogenetic), each emphasizing different aspects of community composition (presence/absence, abundance, phylogenetic).

The code below calculates the distance matrix using Bray-Curtis and every value in this matrix represents the pairwise dissimilarity between two samples. To reduce the complexity of the distance matrix and represent sample relationships in lower-dimensional space, Principal Coordinate Analysis (PCoA) was performed. Statistical testing (PERMANOVA) followed to assess whether the observed patterns/differences are statistically significantly associated with analyzed groups from metadata. PERMANOVA tests for differences in the centroid (average position) of groups, indicating whether groups have different overall compositions.

PCoA_ASV_bray <- ordinate(ps_filt_fixed_samples_rel, distance = "bray", method = "PCoA")
beta_ASV_bray <- phyloseq::distance(ps_filt_fixed_samples_rel, "bray")
ado_bray_group<- adonis2(beta_ASV_bray ~ observation.unit.identifier, metadata_samples)
print(ado_bray_group)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = beta_ASV_bray ~ observation.unit.identifier, data = metadata_samples)
##          Df SumOfSqs      R2      F Pr(>F)    
## Model     2   2.6198 0.13162 4.3199  0.001 ***
## Residual 57  17.2839 0.86838                  
## Total    59  19.9038 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PERMANOVA assumes that the dispersion within groups is homogeneous, i.e. the spread of distances from the group centroid is similar across all groups. Unequal dispersions influence the interpretation of PERMANOVA results. To confirm that the observed differences in PERMANOVA are not simply due to one group having more variability in composition, it is a good practice to check for the homogenity of dispersions. For this, a beta-dispersion test will be performed. However, according to a paper by Anderson and Walsh (2013) it is most important if the sample sizes are not equal. If they are, the dispersion might be heterogenous and one could still interpret PERMANOVA results.

group.bd<- betadisper(d = beta_ASV_bray, group = metadata_samples$observation.unit.identifier, type="centroid")
boxplot(group.bd)

permutest(group.bd)
## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df  Sum Sq   Mean Sq      F N.Perm Pr(>F)
## Groups     2 0.01497 0.0074843 0.6005    999  0.573
## Residuals 57 0.71044 0.0124638

Dispersions do not differ between groups, therefore PERMANOVA results can be interpreted with confidence.

The code below shows how to plot results from PCoA analysis. Here a confidence ellipsis with mean values (plotted as diamonds) is added. This kind of ellipsis shows where are the mean results in each group and it also shows the dispersion of the samples around the mean.

plot_ordination(ps_filt_fixed_samples_rel, PCoA_ASV_bray, axes = c(1,2))+
  stat_conf_ellipse(geom = "polygon", alpha = 0.1, aes(color = observation.unit.identifier,  fill = observation.unit.identifier))+
  geom_point(size = 3, shape = 21, aes(fill = observation.unit.identifier))+
  stat_mean(aes(fill = observation.unit.identifier), shape = 23, size = 4) + 
  annotate("text", x =- 0.3, y = 0.5, color = "black", label = paste0("p=", round(ado_bray_group$`Pr(>F)`[1], digits = 3)))+
  annotate("text", x = -0.3, y = 0.45, color = "black", label = paste0("R\u00b2=", round((ado_bray_group$R2[1])*100, digits = 2), "%"))+
  scale_fill_nejm(name = "observation\nunit")+
  scale_color_nejm(name = "observation\nunit")+
  theme_bw()+
  theme(legend.position = "right")

Following the above example, beta-diversity analysis is performed using Weighted Unifrac distance metric which accounts for both phylogenetic relationships between taxa and their relative abundance.

PCoA_ASV_wunifrac <- ordinate(ps_filt_fixed_samples_rel, distance = "wunifrac", method = "PCoA")
beta_ASV_wunifrac <- phyloseq::distance(ps_filt_fixed_samples_rel, "wunifrac")
ado_wunifrac_group<- adonis2(beta_ASV_wunifrac ~ observation.unit.identifier, metadata_samples)
print(ado_wunifrac_group)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = beta_ASV_wunifrac ~ observation.unit.identifier, data = metadata_samples)
##          Df SumOfSqs      R2      F Pr(>F)    
## Model     2  0.10444 0.13845 4.5799  0.001 ***
## Residual 57  0.64992 0.86155                  
## Total    59  0.75436 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the dispersion also needs to be checked

group_wunifrac.bd<- betadisper(d = beta_ASV_wunifrac, group = metadata_samples$observation.unit.identifier, type="centroid")
boxplot(group_wunifrac.bd)

permutest(group_wunifrac.bd)
## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df   Sum Sq    Mean Sq      F N.Perm Pr(>F)
## Groups     2 0.000482 0.00024084 0.1744    999  0.841
## Residuals 57 0.078723 0.00138110

The dispersion does not differ therefore conclusions can be drawn from PERMANOVA test

plot_ordination(ps_filt_fixed_samples_rel, PCoA_ASV_wunifrac, axes = c(1,2))+
  stat_conf_ellipse(geom = "polygon", alpha = 0.1, aes(color = observation.unit.identifier,  fill = observation.unit.identifier))+
  geom_point(size = 3, shape = 21, aes(fill = observation.unit.identifier))+
  stat_mean(aes(fill = observation.unit.identifier), shape = 23, size = 4) + 
  annotate("text", x = 0.1, y = -0.15, color = "black", label = paste0("p=", round(ado_wunifrac_group$`Pr(>F)`[1], digits = 3)))+
  annotate("text", x = 0.1, y = -0.13, color = "black", label = paste0("R\u00b2=", round((ado_wunifrac_group$R2[1])*100, digits = 2), "%"))+
  scale_fill_nejm(name = "observation\nunit")+
  scale_color_nejm(name = "observation\nunit")+
  theme_bw()+
  theme(legend.position = "right")

Differential abundance analysis (DAA)

A common practice in microbiome analysis is to identify which taxa significantly differ between sample groupings, conditions or treatments applied in the respective study. There are different ways to perform differential abundance testing with microbiome data. The researcher needs to decide which method suits the project aim the best. In the article by Nearing et al. (2022) the performance of different testing methods for DAA is compared. Among the tools showing the most consistent results was MaAsLin2 (Multivariate Association with Linear Models 2) and below we show how to use the R package. The output is generated into separate folders and files.

MaAsLin3 is also available for use and its tutorial can be found here.

Data Preprocessing: Filtering by prevalence

There are several statistical challenges in analyzing microbiome data due to its unique properties that have led to the development of different approaches to ensure statistically reliable results (Yang and Chen, 2022).

Among others, one question raised when processing data before DAA is whether rare taxa should be filtered out. Filtering out potentially uninformative features before running statistical tests could help address the problem of statistical power reduction due to multiple test correction. Nearing et al.(2022). However, data filtering could have unexpected effects in some cases such as false positives increases (type I errors) Bourgon et al. (2010). This is another debatable topic when someone deals with high-dimensional data.

Given that sequencing data of microbial communities are compositional, filtering out features directly affects the relative abundance of each taxon. Nevertheless, it is reasonable to exclude less prevalent features prior to DA testing because their presence might prohibit identification of significant ASVs between conditions by the DA tools.

Choosing the optimal prevalence filtering cut-off involves certain exploratory steps. The following code will keep only taxa with a prevalence of 5% (taxa present in min. 5% of samples).

ps_filt_fixed_samples_prev<- tax_filter(ps_filt_fixed_samples,
  min_prevalence = 0.05)
  
print(ps_filt_fixed_samples_prev)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1761 taxa and 60 samples ]
## sample_data() Sample Data:       [ 60 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 1761 taxa by 8 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1761 tips and 1760 internal nodes ]

In this function other parameters can be also set, such as prevalence detection threshold (prev_detection_threshold). Therefore for example to filter taxa with specific minimum abundance (eg. 0.25%) in a given amount of samples (eg. two), the code would be as follows: tax_filter(ps_filt_fixed_samples, min_prevalence = 2, prev_detection_threshold = 0.0025)

Let’s compare the number of taxa in filtered and unfiltered phyloseq object.

The number of taxa before filtering:

ntaxa(ps_filt_fixed_samples)
## [1] 3772

The number of taxa after filtering:

ntaxa(ps_filt_fixed_samples_prev)
## [1] 1761

The number of removed taxa:

ntaxa_removed <- ntaxa(ps_filt_fixed_samples)-ntaxa(ps_filt_fixed_samples_prev)
print(ntaxa_removed)
## [1] 2011

The number of removed taxa as proportion of all taxa:

ntaxa_removed*100/ntaxa(ps_filt_fixed_samples)
## [1] 53.31389

When applying a 5% prevalence threshold, more than 50% of the taxa are removed. This means that almost half of the taxa were present in at least 5% of the samples. Taxa appearing in less than 5% of the samples were filtered out.

To check what happens to the number of taxa when less strict prevalence thresholds are applied, use the following code:

prev_thresholds <- c(0.05, 0.02, 0.01)

prev_filt_results <- data.frame(min_prevalence = prev_thresholds, num_taxa_retained = NA, perc_taxa_removed = NA)

total_taxa <- nrow(tax_table(ps_filt_fixed_samples))

for (i in seq_along(prev_thresholds)) {
  ps_filt_fixed_samples_prev_all <- ps_filt_fixed_samples %>%
    tax_filter(min_prevalence = prev_thresholds[i])
  
  num_taxa_retained <- nrow(tax_table(ps_filt_fixed_samples_prev_all))
  perc_taxa_removed <- 100 * (total_taxa - num_taxa_retained) / total_taxa
  
  prev_filt_results$num_taxa_retained[i] <- num_taxa_retained
  prev_filt_results$perc_taxa_removed[i] <- perc_taxa_removed
}

print(prev_filt_results)
##   min_prevalence num_taxa_retained perc_taxa_removed
## 1           0.05              1761          53.31389
## 2           0.02              2480          34.25239
## 3           0.01              3772           0.00000

As mentioned above, depending on the research question, the filtering threshold could be an important consideration. For the purpose of the present analysis, we will focus on more common taxa across the samples and thus, we will apply minimum prevalence of 5%.

MaAsLin2

Since 16S rRNA gene sequencing has limited resolution at species level, usually genus level is used. However in case of this dataset many genera were not annotated, therefore we will use the highest rank with the most annotated taxa, which was order level. MaAsLin2 requires input files (feature abundance and metadata) in specific format. For more information on how to run it, check out the link. The code below will prepare the input data which needs to be a data frame. Therefore, the ASV table needs to be extracted from the phyloseq object. Because a phyloseq object grouped at order level will be used - taxonomy is already added as row names to otu_table. Differential abundance analysis in this example is performed on compositionally transformed data.

ps_filt_fixed_samples_order_rel<- tax_transform(ps_filt_fixed_samples_order, trans = "compositional")
input_sample_metadata <- ps_filt_fixed_samples_order_rel%>% meta
input_data_order_Maaslin <- as.data.frame(otu_table(ps_filt_fixed_samples_order_rel))

MaAsLin2 requires a file with features as columns and samples as rows. Any other non-numeric columns (e.g., taxonomic labels) in the ASV table should be removed.

input_data_order_Maaslin <- input_data_order_Maaslin[sapply(input_data_order_Maaslin, is.numeric)]

It is important to make sure the input_metadata is a data frame.

For pairwise comparisons done in MaAsLin2 it is important to indicate which group of samples is the reference. There are 3 ways of doing that: - setting the column describing grouping variable as a factor with 1st level assigned to a control group, - changing names of the groups by prefixing the desired reference category with strings such as “a_” - MaAsLin2 by default treats the first category in alphabetical order as the reference. - by indicating which group (in which column of the metadata) is the reference.

In this tutorial we will use the last method.

Before running MaAsLin2 we need to define the output folder.

output_directory_order <- "Maaslin2_output_order"

Run the MaAsLin2 analysis. The code below uses a linear model to fit the data and identify associations between microbial features and metadata variables. Here we have set the parameters for abundance and prevalence filtering (ASVs that will be tested need to be present in at least 5% of samples). Therefore, we will use the phyloseq object before filtering by prevalence. Data is already normalized using total-sum scaling (TSS),therefore in the options of MaAsLin2 function we choose not to normalize the data for the second time. However, we choose to log transform it. P-values are adjusted using the default Benjamini-Hochberg (BH) method. For other options please check the documentation by executing C:/ProgramData/R/win-library/4.4/Maaslin2/help/Maaslin2 in the console. You can also set random effects to the model.

fit_data_order <- Maaslin2::Maaslin2(
    input_data = input_data_order_Maaslin,
    input_metadata = input_sample_metadata,
    output = output_directory_order,
    analysis_method = "LM",
    fixed_effects = "observation.unit.identifier",
    normalization = "NONE", 
    correction = "BH",
    transform = "LOG",
    min_abundance = 0,
    min_prevalence = 0.05,
    reference = "observation.unit.identifier,CW1-sand")

The result of the run is stored in the separate folder which we set above in the working directory. Visualization output files include separate boxplots for each significant association with categorical metadata and scatterplots with fitted regression line for continuous metadata. Additionally, a heatmap showing 50 top features (at order level) significantly associated with analyzed variables. All results are also saved in tables (incl. all associations and significant associations in separate files). There is also RDS file with all residuals of the model and a log file.

Full explanation can be found here: https://huttenhower.sph.harvard.edu/maaslin/

The first few lines of the significant results looks like that:

head(fit_data_order$results, n=20) %>% kbl() %>%
  kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "1000px", height = "500px")
feature metadata value coef stderr pval name qval N N.not.zero
o__Gemmatales observation.unit.identifier CW3-biochar -3.698429 0.4957888 0.00e+00 observation.unit.identifierCW3-biochar 0.0000002 60 39
c__Gammaproteobacteria.Class observation.unit.identifier CW3-biochar 2.704977 0.4215605 0.00e+00 observation.unit.identifierCW3-biochar 0.0000058 60 32
c__Subgroup_11.Class observation.unit.identifier CW3-biochar -1.769331 0.3071663 4.00e-07 observation.unit.identifierCW3-biochar 0.0000362 60 25
o__Polyangiales observation.unit.identifier CW3-biochar 3.046451 0.5297626 4.00e-07 observation.unit.identifierCW3-biochar 0.0000362 60 59
o__Gemmatales observation.unit.identifier CW2-biochar -3.187981 0.5609210 5.00e-07 observation.unit.identifierCW2-biochar 0.0000372 60 39
o__Thermoanaerobaculales observation.unit.identifier CW3-biochar 2.907719 0.5315996 1.00e-06 observation.unit.identifierCW3-biochar 0.0000685 60 34
o__Thermomicrobiales observation.unit.identifier CW3-biochar -1.801108 0.3509496 3.60e-06 observation.unit.identifierCW3-biochar 0.0002024 60 28
o__PAUC26f observation.unit.identifier CW3-biochar -1.757618 0.3570018 7.60e-06 observation.unit.identifierCW3-biochar 0.0003762 60 28
c__MB.A2.108.Class observation.unit.identifier CW3-biochar -1.762682 0.3699913 1.35e-05 observation.unit.identifierCW3-biochar 0.0005896 60 16
o__Ectothiorhodospirales observation.unit.identifier CW2-biochar 2.132338 0.4523215 1.61e-05 observation.unit.identifierCW2-biochar 0.0006331 60 30
o__Sphingomonadales observation.unit.identifier CW3-biochar -1.857668 0.4003936 2.09e-05 observation.unit.identifierCW3-biochar 0.0007483 60 47
o__Reyranellales observation.unit.identifier CW3-biochar -1.753530 0.3874612 3.11e-05 observation.unit.identifierCW3-biochar 0.0010209 60 31
o__Rhodobacterales observation.unit.identifier CW3-biochar -2.405116 0.5548591 6.01e-05 observation.unit.identifierCW3-biochar 0.0017358 60 19
o__RBG.13.54.9 observation.unit.identifier CW3-biochar 1.540412 0.3560167 6.17e-05 observation.unit.identifierCW3-biochar 0.0017358 60 45
c__Alphaproteobacteria.Class observation.unit.identifier CW3-biochar -1.931087 0.4564534 8.55e-05 observation.unit.identifierCW3-biochar 0.0017857 60 41
o__Ectothiorhodospirales observation.unit.identifier CW3-biochar 1.701981 0.3997995 7.82e-05 observation.unit.identifierCW3-biochar 0.0017857 60 30
p__FCPU426.Phylum observation.unit.identifier CW3-biochar 1.297098 0.3035762 7.41e-05 observation.unit.identifierCW3-biochar 0.0017857 60 20
c__uncultured_Acidobacterium_sp..Class observation.unit.identifier CW3-biochar -1.452569 0.3396592 7.32e-05 observation.unit.identifierCW3-biochar 0.0017857 60 17
c__BD2.11_terrestrial_group.Class observation.unit.identifier CW3-biochar -1.608018 0.3802796 8.61e-05 observation.unit.identifierCW3-biochar 0.0017857 60 20
o__Sphingomonadales observation.unit.identifier CW2-biochar -1.899004 0.4529937 9.74e-05 observation.unit.identifierCW2-biochar 0.0018759 60 47

This is the heatmap.

knitr::include_graphics("./Maaslin2_output_order/figures/heatmap.png")
heatmap, MaAslin2, order level

heatmap, MaAslin2, order level

And one of the boxplots

knitr::include_graphics("./Maaslin2_output_order/figures/observation_unit_identifier_1.png")
boxplot, MaAslin2, order level

boxplot, MaAslin2, order level

All MaAsLin2 analyses can be run also using other taxonomic ranks.

Linda - Linear (Lin) Model for Differential Abundance Analysis (DAA) of High-dimensional Compositional Data

LinDa is another approach to perform DAA which only requires fitting linear regression models on CLR-transformed abundance data as response. This method addresses the compositional effects by correcting the bias using the mode of the regression coefficients and can be extended to mixed-effect models for correlated microbiome data (Zhou et al. (2022)).

LinDA function has been integrated into the CRAN package MicrobiomeStat.

To run LinDa, a matrix of counts/proportions (row-features, column-samples) and a data frame containing sample metadata (without NAs) are needed. In this case, we will use the same phyloseq object (relative abundance, order level) as in MaAsLin2 to create the input data for LinDA.

asv.mat <- otu_table(ps_filt_fixed_samples_order_rel) %>% as.matrix()
metadata.dat <- meta(ps_filt_fixed_samples_order_rel) %>% as.data.frame()

Before running the model, consider which are the fixed and random effects in your dataset. Here, several variables could be tested for their effect on the microbiota composition (type of CW, depth, location, season). In this example, to compare LinDa results to MaasLin2 the type of CW is used as fixed effect. Similar to MaasLin2, the reference group for comparisons is typically determined by alphabetical order of the group labels in the metadata unless specified otherwise.

Prevalence filter will be set at 0.05 as in MaasLin2 above. Other parameters that need consideration for optimizing the performance of LinDa are mentioned in the R documentation of the function and in this article by Li et al. (2023).

linda.obj.order <- linda(feature.dat = asv.mat, meta.dat = metadata.dat,
                         feature.dat.type = 'proportion',
                         formula = '~observation.unit.identifier', 
                         prev.filter = 0.05)
## 70  features are filtered!
## The filtered data has  60  samples and  221  features will be tested!
## Fit linear models ...
## Completed.

The output message gives information about the results of the filtering (prevalence and/or abundance) and the method used for handling zeros in the dataset.

The output file is an object (list of elements) which includes the variable names of the fixed effects based on the formula used, the estimated bias of the regression coefficients due to the compositional effect and a list of data frames including the results of the DAA. All relevant information can be found here.

Let’s inspect the results of LinDA for each comparison.

linda.out <- as.data.frame(linda.obj.order$output)
head(linda.out, n=20) %>% kbl() %>% 
   kableExtra::kable_paper(font_size = 12) %>%
  scroll_box(width = "1000px", height = "500px")
observation.unit.identifierCW2.biochar.baseMean observation.unit.identifierCW2.biochar.log2FoldChange observation.unit.identifierCW2.biochar.lfcSE observation.unit.identifierCW2.biochar.stat observation.unit.identifierCW2.biochar.pvalue observation.unit.identifierCW2.biochar.padj observation.unit.identifierCW2.biochar.reject observation.unit.identifierCW2.biochar.df observation.unit.identifierCW3.biochar.baseMean observation.unit.identifierCW3.biochar.log2FoldChange observation.unit.identifierCW3.biochar.lfcSE observation.unit.identifierCW3.biochar.stat observation.unit.identifierCW3.biochar.pvalue observation.unit.identifierCW3.biochar.padj observation.unit.identifierCW3.biochar.reject observation.unit.identifierCW3.biochar.df
k__Bacteria Domain 2615.1623 -0.9464056 0.4701471 -2.0129990 0.0488465 0.1587511 FALSE 57 2615.1623 -0.9225019 0.4155552 -2.2199261 0.0304168 0.0873001 FALSE 57
o__Rickettsiales 1882.3681 -0.8726679 0.3684927 -2.3682092 0.0212886 0.0931557 FALSE 57 1882.3681 -1.2036368 0.3257046 -3.6954857 0.0004942 0.0034891 TRUE 57
o__Gemmatimonadales 16401.9780 -1.3938145 0.5580967 -2.4974426 0.0154208 0.0774682 FALSE 57 16401.9780 -1.8783182 0.4932924 -3.8077174 0.0003453 0.0029486 TRUE 57
o__Candidatus_Woesebacteria 870.2351 0.0857696 0.2033907 0.4216989 0.6748310 0.7920579 FALSE 57 870.2351 0.2128566 0.1797736 1.1840254 0.2413171 0.3555405 FALSE 57
o__Bacteroidales 2727.4743 1.4723457 0.7833587 1.8795293 0.0652873 0.1873831 FALSE 57 2727.4743 1.6314697 0.6923979 2.3562605 0.0219211 0.0692081 FALSE 57
o__Candidatus_Yanofskybacteria 837.3316 0.0025060 0.1057869 0.0236890 0.9811834 0.9904800 FALSE 57 837.3316 0.1017271 0.0935033 1.0879528 0.2811949 0.3933169 FALSE 57
o__Candidatus_Peribacteria 809.2383 -0.0509682 0.2828091 -0.1802213 0.8576181 0.9156212 FALSE 57 809.2383 -0.1244858 0.2499703 -0.4980025 0.6203983 0.6968140 FALSE 57
o__Gemmatales 18857.2374 -3.1372107 0.5098128 -6.1536524 0.0000001 0.0000178 TRUE 57 18857.2374 -3.6320969 0.4506151 -8.0603087 0.0000000 0.0000000 TRUE 57
o__Planctomycetales 23581.0263 -2.7635271 0.7660334 -3.6075803 0.0006519 0.0146899 TRUE 57 23581.0263 -1.8579543 0.6770843 -2.7440517 0.0081000 0.0314055 TRUE 57
o__Pirellulales 29513.5682 -0.4906958 0.5191926 -0.9451133 0.3485902 0.5135896 FALSE 57 29513.5682 -0.6835982 0.4589057 -1.4896266 0.1418353 0.2487745 FALSE 57
o__Opitutales 3113.7280 1.5026224 0.6038263 2.4885012 0.0157741 0.0774682 FALSE 57 3113.7280 1.5131727 0.5337121 2.8351855 0.0063269 0.0263819 TRUE 57
o__S-BQ2-57_soil_group 2211.2237 -0.8820417 0.3743333 -2.3563001 0.0219190 0.0931557 FALSE 57 2211.2237 -1.2297018 0.3308671 -3.7166041 0.0004621 0.0034044 TRUE 57
o__Methylacidiphilales 700.1043 0.2995374 0.1921255 1.5590716 0.1245149 0.2942982 FALSE 57 700.1043 0.4777785 0.1698166 2.8134973 0.0067130 0.0274587 TRUE 57
o__Verrucomicrobiales 6260.8613 -1.8827093 0.4593685 -4.0984728 0.0001332 0.0058880 TRUE 57 6260.8613 -1.4976424 0.4060282 -3.6885178 0.0005052 0.0034891 TRUE 57
o__Pedosphaerales 13430.6233 -0.2928035 0.4951305 -0.5913663 0.5566131 0.7151831 FALSE 57 13430.6233 -0.4257263 0.4376376 -0.9727826 0.3347716 0.4377782 FALSE 57
o__Chthoniobacterales 11366.3380 -1.0685677 0.5999567 -1.7810745 0.0802288 0.2136212 FALSE 57 11366.3380 -1.6244625 0.5302918 -3.0633367 0.0033392 0.0160426 TRUE 57
c__OM190 Class 5451.4862 -1.3953827 0.4325716 -3.2257841 0.0020825 0.0230117 TRUE 57 5451.4862 -1.1341031 0.3823429 -2.9661934 0.0043989 0.0198056 TRUE 57
o__11-24 1514.9431 -0.7997962 0.3304210 -2.4205372 0.0187066 0.0879609 FALSE 57 1514.9431 -0.6364988 0.2920536 -2.1793902 0.0334496 0.0918395 FALSE 57
o__Rhizobiales 115463.1362 0.1687520 0.4340918 0.3887474 0.6989114 0.7920996 FALSE 57 115463.1362 0.2083787 0.3836866 0.5430962 0.5891812 0.6746582 FALSE 57
o__Rhodobacterales 3731.6519 -1.2185165 0.5947999 -2.0486158 0.0451144 0.1533890 FALSE 57 3731.6519 -2.2240341 0.5257338 -4.2303425 0.0000856 0.0011125 TRUE 57

LinDA results can be plotted using the ‘linda.plot’ function which produces the effect size plot of the differential features and volcano plots based on the output. For more details on how the function works, see here.

In this function, the object from function linda is used as input and other arguments such as the variable whose results are to be plotted, significance level (alpha), log2fc cut-off and directory to save the output plots can be set.

The output file is a list of ggplot2 objects and more specifically, a list of effect size plots and a list of volcano plots found in the output directory. In both cases, each plot corresponds to one variable. Plots are saved in pdf format in the output directory. To store the figures separately, see below.

lfc.plot1 <- linda.res.plot.order$plot.lfc[[1]]
volc.plot1 <- linda.res.plot.order$plot.volcano[[1]]

ggsave("./linda_plots/lfc.plot1.png", lfc.plot1, device = png, width = 10, height = 7)
ggsave("./linda_plots/volc.plot1.png", volc.plot1, device = png, width = 10, height = 7)
knitr::include_graphics(path = "./linda_plots/lfc.plot1.png")

This is one of the effect size plots of the differentially abundant features (taxa at order level) between the different types of wetlands in the dataset. Debiased effect size is the adjusted effect size that accounts for the compositional nature of microbiome data (corrected for systematic biases-more reliable). Biased effect size is also reported for comparative reasons and to understand the magnitude of correction applied. It allows to see how much the compositional nature of the data may have affected the raw estimates.

knitr::include_graphics(path = "./linda_plots/volc.plot1.png")

This is one of the volcano plots of the differentially abundant features (taxa at order level) between the different types of wetlands in the dataset. Taxa (blue data points) in the upper right corner are enriched in the test group (CW2 or CW3) while in the upper left corner are depleted in the test group or enriched in the reference group (CW1).

Comparison of the DAA results from MaAsLin2 and LinDA

It is expected that different tools will give different output as they might test different hypotheses and use different statistical tests, transformations and they may assume different data distribution. Here, both MaAsLin2 and LinDA apply linear models, however in MaasLin2 - log transformation (default) was used, whereas in LinDA - CLR. Additionally, LinDA applied bias correction, that was initially introduced by ANCOM-BC workflow (Lin, Peddada, 2020).

To compare the results we will check how many features have statistically significantly different abundance when using both workflows. In order to do so, we will combine the results summary from MaAsLin2 and LinDA and focus on features with adjusted p value <0.05. Before merging results from both pipelines, we need to unify the names of the features, since MaAsLin2 replaced spaces, dashes and brackets with dots. Therefore, we will do the same with LinDA output.

linda.out_p.adj<- linda.out %>% rownames_to_column(var = "feature") %>%
       select(feature, observation.unit.identifierCW2.biochar.padj, observation.unit.identifierCW3.biochar.padj) %>%
  mutate(feature = str_replace_all(feature, " ", ".")) %>%
  mutate(feature = str_replace_all(feature, "-", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\(", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\)", ".")) %>%
       dplyr::rename(CW2 = observation.unit.identifierCW2.biochar.padj, CW3 = observation.unit.identifierCW3.biochar.padj) %>%
       pivot_longer(cols = c(CW2, CW3), names_to = "name", values_to = "p.adj_LinDA") %>%
       mutate(feature_group = paste0(feature, "_", name))

maaslin_p.adj<- fit_data_order$results %>% select(feature, value, qval) %>%
       dplyr::rename(name = value, p.adj_Maaslin = qval) %>%
       mutate(name = case_when(name == "CW2-biochar" ~ "CW2",
                               name == "CW3-biochar" ~ "CW3")) %>%
       mutate(feature_group = paste0(feature, "_", name))

Although the same prevalence filtering was set in case of MaasLin2 and LinDA, the amount of tested features (taxa) is different. As we can see below all tested features included in MaAsLin2 analysis can also be found in LinDA analysis.However, it is not true for the reverse association.

all(maaslin_p.adj$feature %in% linda.out_p.adj$feature)
## [1] TRUE
all(linda.out_p.adj$feature %in% maaslin_p.adj$feature)
## [1] FALSE

The reason for that is that in the case of LinDA more features were included in the analysis after applying the prevalence filter. In the case of MaAsLin2, it is {r length(fit_data_order\$results %\>% select(feature) %\>% unique %\>% pull(feature))} whereas in the case of LinDA {r dim(linda.out)[1]}. Different number of features are left after prevalence filtering due to the used formula: - MaAsLin2 keeps features when present in higher amount of samples than the prevalence threshold (in this case >3, which means at least 4). - LinDA keep features when at least a prevalence threshold is reached (>=3, which means at least 3).

joined_result<- full_join(linda.out_p.adj, maaslin_p.adj[,c(3,4)], by = "feature_group")
joined_result_signif<-joined_result %>% filter(p.adj_LinDA<0.05 | p.adj_Maaslin<0.05) %>%
  mutate(feature_name = paste(feature, "_", name))

The Venn diagrams presented below show the number of differentially abundant features which were detected using both pipelines (intersection of circles), only by LinDA (blue) and only by MaAsLin2 (red) when comparing pairwisely the types of CW.

LinDA_CW2<- joined_result_signif%>%filter(name == "CW2" & p.adj_LinDA<0.05) %>% pull(feature)
LinDA_CW3<- joined_result_signif%>%filter(name == "CW3" & p.adj_LinDA<0.05) %>% pull(feature)
Maaslin_CW2<- joined_result_signif%>%filter(name == "CW2" & p.adj_Maaslin<0.05) %>% pull(feature)
Maaslin_CW3<- joined_result_signif%>%filter(name == "CW3" & p.adj_Maaslin<0.05) %>% pull(feature)
Venn_CW2<- list(LinDA = LinDA_CW2, Maaslin2 = Maaslin_CW2)
Venn_CW3<- list(LinDA = LinDA_CW3, Maaslin2 = Maaslin_CW3)
Venn_CW2 <- Venn(Venn_CW2)
Venn_CW2 <- process_data(Venn_CW2, shape_id = "201")
venn_plot_CW2<-plot_venn(Venn_CW2)+
  scale_fill_distiller(palette = "RdBu")+
  ggtitle("CW1 vs. CW2")
Venn_CW3 <- Venn(Venn_CW3)
Venn_CW3 <- process_data(Venn_CW3, shape_id = "201")
venn_plot_CW3<- plot_venn(Venn_CW3)+
  scale_fill_distiller(palette = "RdBu")+
  ggtitle("CW1 vs. CW3")

ggarrange(venn_plot_CW2, venn_plot_CW3, nrow=2)

Over 70% of the total number of features were detected by both pipelines indicating a significant overlap in the results. In general, MaAsLin2 detects higher number of differentially abundant features in each comparison that were not detected by LinDA.

Features shown to be different using MaAsLin2 only:

  • for comparison between CW1 and CW2: o__Rickettsiales, o__11.24, o__Immundisolibacterales, o__Kiloniellales, o__Latescibacterales, o__Nitrospirales, o__Subgroup_15, o__Propionibacteriales, o__Micrococcales, o__uncultured_actinobacterium
  • for comparison between CW1 and CW3: k__Bacteria.Domain, o__NRL2, o__Ferrovibrionales, o__Puniceispirillales, o__Salinisphaerales, o__Isosphaerales, c__Pla4_lineage.Class, o__Subgroup_15, o__uncultured_Anaerolineaceae_bacterium, o__Lineage_IV, c__Acidimicrobiia.Class, o__uncultured_Catenulispora_sp.

Features shown to be different using LinDA only:

  • for comparison between CW1 and CW2: o__053A03.B.DI.P58, o__1.20
  • for comparison between CW1 and CW3: o__Methylacidiphilales, o__Thiotrichales, o__Myxococcales

The relationship between log2 fold changes (between reference and tested group) calculated by LinDA and coefficients calculated by MaAsLin2 can be assessed by the following scatter plots. The strength of the relationship was measured by calculating a Pearson correlation coefficient (R).

linda.out_log_change<- linda.out %>% rownames_to_column(var = "feature") %>%
       select(feature, CW2 = observation.unit.identifierCW2.biochar.log2FoldChange, CW3 = observation.unit.identifierCW3.biochar.log2FoldChange) %>%
  mutate(feature = str_replace_all(feature, " ", ".")) %>%
  mutate(feature = str_replace_all(feature, "-", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\(", ".")) %>%
  mutate(feature = str_replace_all(feature, "\\)", ".")) %>%
       pivot_longer(cols = c(CW2, CW3), names_to = "name", values_to = "log_change_LinDA") %>%
  mutate(feature_name = paste0(feature, "_", name))

Maaslin_coef<- fit_data_order$results %>% select(feature, name = value, coef_Maaslin2 = coef) %>%
       mutate(name = case_when(name == "CW2-biochar" ~ "CW2",
                               name == "CW3-biochar" ~ "CW3")) %>%
  mutate(feature_name = paste0(feature, "_", name))

joined_result_coef<-linda.out_log_change %>% full_join(Maaslin_coef[,c(3,4)], by="feature_name") %>%
       left_join(joined_result[,c(3,4,5)], by = c("feature_name" = "feature_group")) %>% mutate(signif = case_when(p.adj_Maaslin <=0.05 & p.adj_LinDA <= 0.05 ~ "signif_both",
          p.adj_Maaslin <=0.05 & p.adj_LinDA > 0.05 ~ "signif_Maaslin2",
          p.adj_Maaslin >0.05 & p.adj_LinDA <= 0.05 ~ "signif_LinDA",
          is.na(p.adj_Maaslin) & p.adj_LinDA <= 0.05 ~ "signif_LinDA",
                                                      TRUE ~ "NS"))
joined_result_coef %>% 
  ggplot(aes(x = log_change_LinDA, y = coef_Maaslin2)) +
  geom_jitter(shape = 21, alpha = 0.6, size = 2, width = 0.3, height = 0.1, aes(fill = signif))+
  stat_cor(method = "pearson", label.x = -2, label.y = 2)+
  facet_wrap(~name)+
  scale_fill_nejm()+
  theme_bw()

On the plot we can see that the direction of differences is the same. The two variables have a positive association and a strong linear relationship.

Other useful resources

Packages

microViz

miaViz

microbiome

microbiomeutilities

A list of R environment based tools for microbiome data exploration, statistical analysis and visualization

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_World.utf8  LC_CTYPE=English_World.utf8   
## [3] LC_MONETARY=English_World.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_World.utf8    
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggVennDiagram_1.5.2   stringr_1.5.1         plyr_1.8.9           
##  [4] dplyr_1.1.4           MicrobiomeStat_1.2    biomeUtils_0.022     
##  [7] phylotools_0.2.2      Biostrings_2.72.1     GenomeInfoDb_1.40.1  
## [10] XVector_0.44.0        IRanges_2.38.1        S4Vectors_0.42.1     
## [13] BiocGenerics_0.50.0   ape_5.8               ComplexHeatmap_2.20.0
## [16] ranacapa_0.1.0        patchwork_1.3.0       GGally_2.2.1         
## [19] Maaslin2_1.18.0       here_1.0.1            rstatix_0.7.2        
## [22] kableExtra_1.4.0      ggsci_3.2.0           forcats_1.0.0        
## [25] RColorBrewer_1.1-3    scales_1.3.0          pals_1.9             
## [28] ggpubr_0.6.0          tidyr_1.3.1           tibble_3.2.1         
## [31] ggtree_3.12.0         microViz_0.12.6       vegan_2.6-8          
## [34] lattice_0.22-6        permute_0.9-7         microbiome_1.26.0    
## [37] ggplot2_3.5.1         phyloseq_1.48.0       taxize_0.10.0        
## [40] rentrez_1.2.3        
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.1           urltools_1.7.3          ggplotify_0.1.2        
##   [4] triebeard_0.4.1         rpart_4.1.24            XML_3.99-0.18          
##   [7] lifecycle_1.0.4         Rdpack_2.6.2            doParallel_1.0.17      
##  [10] rprojroot_2.0.4         MASS_7.3-61             backports_1.5.0        
##  [13] magrittr_2.0.3          sass_0.4.9              rmarkdown_2.29         
##  [16] jquerylib_0.1.4         yaml_2.3.10             cowplot_1.1.3          
##  [19] pbapply_1.7-2           mapproj_1.2.11          DBI_1.2.3              
##  [22] minqa_1.2.8             ade4_1.7-22             maps_3.4.2.1           
##  [25] abind_1.4-8             zlibbioc_1.50.0         Rtsne_0.17             
##  [28] quadprog_1.5-8          purrr_1.0.4             hash_2.2.6.3           
##  [31] yulab.utils_0.2.0       seriation_1.5.7         circlize_0.4.16        
##  [34] GenomeInfoDbData_1.2.12 rmutil_1.1.10           ggrepel_0.9.6          
##  [37] tidytree_0.4.6          crul_1.5.0              pheatmap_1.0.12        
##  [40] spatial_7.3-18          svglite_2.1.3           codetools_0.2-20       
##  [43] getopt_1.20.4           xml2_1.3.6              tidyselect_1.2.1       
##  [46] shape_1.4.6.1           aplot_0.2.4             httpcode_0.3.0         
##  [49] UCSC.utils_1.0.0        farver_2.1.2            stable_1.1.6           
##  [52] lme4_1.1-36             TSP_1.2-4               matrixStats_1.5.0      
##  [55] jsonlite_1.8.9          GetoptLong_1.0.5        multtest_2.60.0        
##  [58] Formula_1.2-5           survival_3.7-0          iterators_1.0.14       
##  [61] systemfonts_1.2.1       foreach_1.5.2           tools_4.4.1            
##  [64] treeio_1.28.0           Rcpp_1.0.13             glue_1.8.0             
##  [67] xfun_0.49               mgcv_1.9-1              ca_0.71.1              
##  [70] numDeriv_2016.8-1.1     withr_3.0.2             timeSeries_4041.111    
##  [73] fastmap_1.2.0           boot_1.3-31             rhdf5filters_1.16.0    
##  [76] digest_0.6.37           R6_2.6.0                gridGraphics_0.5-1     
##  [79] colorspace_2.1-1        Cairo_1.6-2             dichromat_2.0-0.1      
##  [82] modeest_2.4.0           utf8_1.2.4              generics_0.1.3         
##  [85] data.table_1.16.2       DECIPHER_3.0.0          robustbase_0.99-4-1    
##  [88] httr_1.4.7              ggstats_0.8.0           pkgconfig_2.0.3        
##  [91] gtable_0.3.6            timeDate_4041.110       registry_0.5-1         
##  [94] picante_1.8.2           pcaPP_2.0-5             htmltools_0.5.8.1      
##  [97] carData_3.0-5           biomformat_1.32.0       clue_0.3-66            
## [100] logging_0.10-108        Biobase_2.64.0          png_0.1-8              
## [103] optparse_1.7.5          reformulas_0.4.0        ggfun_0.1.8            
## [106] knitr_1.49              rstudioapi_0.17.1       reshape2_1.4.4         
## [109] rjson_0.2.23            statip_0.2.3            nloptr_2.1.1           
## [112] nlme_3.1-166            curl_6.2.0              cachem_1.1.0           
## [115] zoo_1.8-12              rhdf5_2.48.0            GlobalOptions_0.1.2    
## [118] parallel_4.4.1          fBasics_4041.97         pillar_1.10.1          
## [121] vctrs_0.6.5             car_3.1-3               cluster_2.1.6          
## [124] evaluate_1.0.3          mvtnorm_1.3-3           cli_3.6.3              
## [127] compiler_4.4.1          rlang_1.1.4             crayon_1.5.3           
## [130] ggsignif_0.6.4          labeling_0.4.3          fs_1.6.5               
## [133] stringi_1.8.4           viridisLite_0.4.2       lmerTest_3.1-3         
## [136] munsell_0.5.1           lazyeval_0.2.2          Matrix_1.7-1           
## [139] stabledist_0.7-2        Rhdf5lib_1.26.0         statmod_1.5.0          
## [142] rbibutils_2.3           igraph_2.1.1            broom_1.0.7            
## [145] bslib_0.9.0             phangorn_2.12.1         biglm_0.9-3            
## [148] fastmatch_1.1-6         DEoptimR_1.1-3-1