Setup#
To use the UNLOCK FAIR Data Platform computational workflows, follow the steps below to install required tools, clone the repository, and run a test workflow.
Requirements#
- Docker 
- A CWL runner (we use cwltool) 
- The UNLOCK CWL repository: m-unlock/cwl/ 
- A (powerful) server, depending on your workflow and inputs 
Installation steps#
- Install Docker: Follow the installation guide: Get Docker - dockerdocs. Make sure Docker is running before executing the workflows. 
- Install a CWL Runner: For our workflows we use the reference runner cwltool, installation instructions are available on the CWL GitHub. Other runners are listed here, but we have not tested them yet. 
- Clone the UNLOCK CWL Repository: Clone the full CWL repository with the following command: - git clone https://gitlab.com/m-unlock/cwl.git
Workflow execution#
Repository structure#
The workflows/ folder in the repository contains multi-step workflows that combine the tools from the tools/ folder. Each major workflows is published at WorkflowHub.
In this section, we will test the Metagenomic Assembly Workflow, which is a composite workflow that calls other workflows as nested subworkflows.
Note: do not change the folder structure, as the workflow internally calls the cwl files using relative paths.
To check all of the possible input options of the workflow, run:
cwltool cwl/workflows/workflow_metagenomic_assembly.cwl --help
(You may see warnings here - these can be ignored.)
While workflows can be executed directly with command-line arguments, defining inputs in a YAML file improves reusability and readbility.
Executing a CWL workflow#
As a test run, we will execute the workflow_metagenomics_assembly.cwl workflow with a test YAML file found in the tests/ folder.
YAML file: tests/assembly/hybrid_small.yaml
identifier: hybrid_TEST
threads: 4
memory: 4000
run_spades: true
run_flye: true
binning: false
metagenome: true
keep_filtered_reads: true
run_medaka: false
ont_basecall_model: r941_min_hac_g507
nanopore_reads:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/long_reads_high_depth.fastq.gz
illumina_forward_reads:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_1.fastq.gz
illumina_reverse_reads:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/short_reads_2.fastq.gz
filter_references:
   - class: File
     path: http://download.systemsbiology.nl/unlock/cwl/test_data/human_small.fa.gz
Parameter descriptions#
- identifier: Prefix for naming the output files and folders 
- memory: Memory allocation for tools that have a specific memory option. (So not a general limit) 
- threads: Number of CPU threads to use in tools that have multithreading support 
- run_spades: Enable SPAdes assembler (hybrid) 
- run_flye: Enable Flye assembler 
- binning: If - true, runs the binning worklow
- metagenome: Indicates whether the sample is a metagenome (affects assembler behaviour) 
- run_medaka: Disables Medaka ONT assembly polishing 
- ont_basecall_model: Required for medaka 
- nanopore_reads and illumina_reads: Paths to read files (Local or HTTP-accessible) 
Parameter order in the YAML (or as input arguments) does not affect the workflow behavior.
Run the workflow#
cwltool --outdir assembly_test cwl/workflows/workflow_metagenomics_assembly.cwl cwl/tests/assembly/hybrid_small.yaml
This test dataset is very small and does not produce meaningful results. Output files (assemblies, quality reports etc.) will be saved in the directory assembly_test as defined in the cwltool flag –outdir.
(Test data source: Unicycler assembler sample_data)
cwltool tips:#
- --cachedir .→ Allows resuming a failed run by keeping intermediate outputs.
- --tmpdir-prefix .→ Sets a custom location for temporary files. which can be quite large.
- --provenance→ Captures workflow execution details, this is used by default in the UNLOCK infrastructure.
A more detailed explanation of data provenance, the workflows and their use cases will be available soon.
