Setup

Setup#

To use the UNLOCK FAIR Data Platform computational workflows, follow the steps below to install required tools, clone the repository, and run a test workflow.

Requirements#

Docker
A CWL runner (we use cwltool)
The UNLOCK CWL repository: m-unlock/cwl/
A (powerful) server, depending on your workflow and inputs

Installation steps#

Install Docker: Follow the installation guide: Get Docker - dockerdocs. Make sure Docker is running before executing the workflows.
Install a CWL Runner: For our workflows we use the reference runner cwltool, installation instructions are available on the CWL GitHub. Other runners are listed here, but we have not tested them yet.
Clone the UNLOCK CWL Repository: Clone the full CWL repository with the following command: git clone https://gitlab.com/m-unlock/cwl.git

Workflow execution#

Repository structure#

The workflows/ folder in the repository contains multi-step workflows that combine the tools from the tools/ folder. Each major workflows is published at WorkflowHub. In this section, we will test the Metagenomic Assembly Workflow, which is a composite workflow that calls other workflows as nested subworkflows.

Note: do not change the folder structure, as the workflow internally calls the cwl files using relative paths.

To check all of the possible input options of the workflow, run:

cwltool cwl/workflows/workflow_metagenomic_assembly.cwl --help

(You may see warnings here - these can be ignored.)

While workflows can be executed directly with command-line arguments, defining inputs in a YAML file improves reusability and readbility.

Executing a CWL workflow#

As a test run, we will execute the workflow_metagenomics_assembly.cwl workflow with a test YAML file found in the tests/ folder. YAML file: tests/assembly/hybrid_small.yaml

identifier: hybrid_TEST
threads: 4
memory: 4000
run_spades: true
run_flye: true
binning: false
metagenome: true
keep_filtered_reads: true
run_medaka: false
ont_basecall_model: r941_min_hac_g507
nanopore_reads:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/long_reads_high_depth.fastq.gz
illumina_forward_reads:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_1.fastq.gz
illumina_reverse_reads:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_2.fastq.gz
filter_references:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/human_small.fa.gz

Parameter descriptions#

identifier: Prefix for naming the output files and folders
memory: Memory allocation for tools that have a specific memory option. (So not a general limit)
threads: Number of CPU threads to use in tools that have multithreading support
run_spades: Enable SPAdes assembler (hybrid)
run_flye: Enable Flye assembler
binning: If true, runs the binning worklow
metagenome: Indicates whether the sample is a metagenome (affects assembler behaviour)
run_medaka: Disables Medaka ONT assembly polishing
ont_basecall_model: Required for medaka
nanopore_reads and illumina_reads: Paths to read files (Local or HTTP-accessible)

Parameter order in the YAML (or as input arguments) does not affect the workflow behavior.

Run the workflow#

cwltool --outdir assembly_test cwl/workflows/workflow_metagenomics_assembly.cwl cwl/tests/assembly/hybrid_small.yaml

This test dataset is very small and does not produce meaningful results. Output files (assemblies, quality reports etc.) will be saved in the directory assembly_test as defined in the cwltool flag –outdir.

(Test data source: Unicycler assembler sample_data)

cwltool tips:#

--cachedir . → Allows resuming a failed run by keeping intermediate outputs.
--tmpdir-prefix . → Sets a custom location for temporary files. which can be quite large.
--provenance → Captures workflow execution details, this is used by default in the UNLOCK infrastructure.

A more detailed explanation of data provenance, the workflows and their use cases will be available soon.