Bioinformatics Workflow overview

Bioinformatics Workflow overview#

The UNLOCK FAIR Data Platform provides several bioinformatics workflows designed to automate data analysis. These workflows are written in CWL (Common Workflow Language) and the tools are containerized using Docker. This reduces the need for complex dependencies while allowing us to incorporate many different tools. We priortize using publicly available and community-maintained images, but host our own when necessary.

Access and usage

Below is a summary of key workflows available at the FDP. For a full, detailed list of published workflows (including inputs, steps, and outputs), visit the UNLOCK WorkflowHub.

Development of these workflow is done on our GitLab page: gitlab.com/m-unlock/cwl

For a short introduction on how to set up and run these workflows, see the Setup section.

Workflow: Metagenomics Assembly#

View on Workflowhub

This workflow assembles genomes from Illumina reads and/or long reads. Assembled contigs or scaffolds can be binned to aquire bins/MAGs. The bins can also be further annotated.
It is customizable to a certain extent regarding which steps to run and can also be used for isolates.

Steps involved:

Illumina Quality Workflow (see below)
Long Read Quality Workflow (see below)
Taxonomic read classification (kraken2/bracken)
Short or Hybrid assembly (SPAdes)
Long read only assembly (Flye)
Assembly polishing using short reads (pypolca) (optional)
Polish Nanopore assemblies with ONT reads (Medaka) (optional)
QUAST (Assembly quality report)
Read mapping to assembly (minimap2) (optional)
Metagenomics Binning workflow (see below) (optional)
Metagenomics GEM workflow (see below) (optional)

Workflow: Illumina Quality#

View on Workflowhub

This workflow ensures high-quality Illumina read data before further analysis.

Steps included:

Quality plots, before and after filtering (Sequali)
Quality and length filtering (fastp)
Host read removal (Hostile)
Reference filter removing mapped or unmapped reads. (Hostile) (optional)

Workflow: Long Reads Quality#

View on Workflowhub

This workflow ensures high-quality Nanopore/long-read data before further analysis.

Steps included:

Quality plots and reports, before and after filtering (NanoPlot)
Quality and length filtering (fastplong)
Host read removal (Hostile) (optional)
Reference filter removing mapped or unmapped reads. (Hostile) (optional)

Workflow: Metagenomics Binning#

View on Workflowhub

This workflow groups assembled sequences into genome bins representing individual microbial species (MAGs).

Steps included:

Binners: Metabat2 / MaxBin2 / SemiBin2
Bin refinement (Binette)
Eukaryotic contig classification (EukRep)
CheckM2 bin quality
BUSCO bin quality
Bin taxonomic classification (GTDB-Tk)
Combined summary and overviews of refined bins
Annotation of bins (see workflow microbial annotation below)

Workflow: Microbial genome annotation#

View on Workflowhub

Workflow to annotate microbial a (meta-)genome

Steps included:

Rapid & standardized annotation with Bakta
Functional annotation of novel sequences with eggNOG mapper (optional)
Classify sequences into families and predict the presence of domains and significant sites with InterProScan (optional)
Function annotation based on KEGG Orthology with KoFamScan (optional)
To RDF conversion with SAPP (optional, default on)

Workflow: Longread 16S Classification#

View on Workflowhub

Workflow for quality assessment and taxonomic classification of full length 16S sequences.

Steps included:

Long Read Quality Workflow (see above)
Emu abundance; species-level taxonomic abundance for full-length 16S read

Workflow: Short read amplicon classification and functional prediction#

View on Workflowhub

Workflow for Short read amplicon classification and functional prediction

Steps included:

Quality plots (FastQC)
High-throughput Amplicon Analysis (NG-TAX 2)
Function prediction from marker gene sequences. (PICRUSt2)

Workflow: Metagenomics GEM#

View on Workflowhub

!! Important caveat: The CarveMe, MEMOTA and SMETANA Docker container images used in this workflow include the licenced CPLEX Optimizer. Therefore, we cannot make these images public. This means the workflow will not work out-of-the-box. However, we have made the Docker Build files available here

Steps included:

Prodigal protein prediction
CarveMe GEnome-scale Metabolic model reconstruction
MEMOTE for metabolic model testing
SMETANA Species METabolic interaction ANAlysis

Bioinformatics Workflow overview

Contents

Bioinformatics Workflow overview#

Workflow: Metagenomics Assembly#

Workflow: Illumina Quality#

Workflow: Long Reads Quality#

Workflow: Metagenomics Binning#

Workflow: Microbial genome annotation#

Workflow: Longread 16S Classification#

Workflow: Short read amplicon classification and functional prediction#

Workflow: Metagenomics GEM#