Setup#
To use the UNLOCK computational workflows, follow the steps below.
Requirements#
Docker
A CWL runner (we use cwltool)
The UNLOCK CWL repository: m-unlock/cwl/
A (powerful) server, depending on your workflow and inputs
Installation steps#
Install Docker: Follow the installation guide: Get Docker - dockerdocs, and make sure to have Docker running when executing the workflows.
Install a CWL Runner: For our workflows we use the reference runner cwltool, installation instructions are available on the CWL GitHub. Other runners are listed here, but we have not tested them yet.
Clone the UNLOCK CWL Repository: Clone the full CWL repository with the following command:
git clone https://gitlab.com/m-unlock/cwl.git
Workflow execution#
CWL file structure#
The “workflows” folder in the repository contains multi-step workflows that combine the tools from the “tools” folder. These workflows are published at WorkflowHub. In this section, we will test the Metagenomic Assembly Workflow, which is a composite workflow that calls other workflows as nested subworkflows.
Note: do not change the folder structure, as the workflow internally calls the cwl files using relative paths.
To check all of the possible input options of the workflow, run:
cwltool cwl/workflows/workflow_metagenomic_assembly.cwl --help
(You may see warnings here - these can be ignored.)
While workflows can be executed directly with command-line arguments, defining inputs in a YAML file improves reusability and readbility.
Executing a CWL workflow#
As a test run, we will execute the workflow_metagenomics_assembly.cwl workflow with a test YAML file found in the tests folder.
YAML file: tests/assembly/hybrid_small.yaml
identifier: hybrid_TEST
threads: 4
memory: 4000
run_spades: true
run_flye: true
binning: false
metagenome: true
keep_filtered_reads: true
run_medaka: true
ont_basecall_model: r941_min_hac_g507
nanopore_reads:
- class: File
path: http://download.systemsbiology.nl/unlock/cwl/test_data/long_reads_high_depth.fastq.gz
illumina_forward_reads:
- class: File
path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_1.fastq.gz
illumina_reverse_reads:
- class: File
path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_2.fastq.gz
filter_references:
- class: File
path: http://download.systemsbiology.nl/unlock/cwl/test_data/human_small.fa.gz
Parameter descriptions#
identifier: Prefix for naming the output files and folders
memory: Memory allocation for tools that have a specific memory option. (So not a general limit)
threads: Number of CPU threads to use in tools that have multithreading support
run_spades: Enable SPAdes assembler (hybrid)
run_flye: Enable Flye assembler
binning: If
true
, runs the binning worklowmetagenome: Indicates whether the sample is a metagenome (affects assembler behaviour)
run_medaka: Enables Medaka ONT assembly polishing
ont_basecall_model: Required for medaka
nanopore_reads and illumina_reads: Paths to read files (Local or HTTP-accessible)
Parameter order in the YAML (or as input arguments) does not affect the workflow behavior.
Run the workflow#
cwltool --outdir assembly_test cwl/workflows/workflow_metagenomics_assembly.cwl cwl/tests/assembly/hybrid_small.yaml
This test dataset is very small and does not produce meaningful results. Output files (assemblies, quality reports etc.) will be saved in the directory assembly_test
as defined in the cwltool flag –outdir.
(Test data source: Unicycler assembler sample_data)
cwltool tips:#
--cachedir .
→ Allows resuming a failed run by keeping intermediate outputs.--tmpdir-prefix .
→ Sets a custom location for temporary files. which can be quite large.--provenance
→ Captures workflow execution details, this is used by default in the UNLOCK infrastructure.
A more detailed explanation of data provenance, the workflows and their use cases will be available soon.