Skip to content

Latest commit

 

History

History
 
 

execution_workflows

Execution workflows

This is where the execution workflows for APAeval live.

Overview

Execution workflows contain all steps that need to be run per method:

  1. Pre-processing: Convert the input files the APAeval team has prepared into the input files a given method consumes, if applicable. This does not include e.g. adapter trimming or mapping of reads, as those steps are already performed in our general pre-processing pipeline. Pre-processing here means you have to convert the provided .bam, .fastq.gz, .gtf or .gff files to a format that your method can use.
  2. Method execution: Execute the method in any way necessary to compute the output files for all challenges (may require more than one run of the tool if, e.g., run in different execution modes).
  3. Post-processing: Convert the output files of the method into the formats consumed by the summary workflows as specified by the APAeval team, if applicable.

Execution workflows should be implemented in either Nexflow or Snakemake, and individual steps should be isolated through the use of either Conda virtual environments (deprecated; to run on AWS we need containerized workflows) or Docker/Singularity containers.

Templates

To implement an execution workflow for a method, copy either the snakemake template or the nextflow template into the method's directory and adapt the workflow directory names as described in the template's README. Don't forget to adapt the README itself as well.

Example:

execution_workflows/
 |--QAPA/
     |--QAPA_snakemake/
          |--workflow/Snakefile
          |--config/config.QAPA.yaml
          |--envs/QAPA.yaml
          |--envs/QAPA.Dockerfile
          |-- ...
 |--MISO/
      |--MISO_nextflow/
          |-- ...

Input

Test data

For development and debugging you can use the small test input dataset we provide with this repository. You should use the .bam, .fastq.gz, .gtf and/or .gff files as input to your workflow. The .bed file serves as an example for a ground truth file.

Parameters

Both snakemake template and nextflow template contain example sample.csv files. Here you'd fill in the paths to the samples you'd be running, and any other sample specific information required by the method you're implementing. Thus, you can/must adapt the fields of this samples.csv according to your workflow's requirements.

Moreover, both workflow languages require additional information in config files. This is the place to specify run- or method-specific parameters

Important notes:

  • Describe in your README extensively where parameters (sample info, method specific parameters) have to be specified for a new run of the pipeline.
  • Parameterize your code as much as possible, so that the user will only have to change the sample sheet and config file, and not the code. E.g. output file paths should be built from information the user has filled into the sample sheet or config file.
  • For information on how files need to be named see below!

Output

In principle you are free to store output files how it best suits you (or the method). However, the "real" and final outputs for each run of the benchmarking will need to be copied to a directory in the format
PATH/TO/s3-BUCKET/PARAMCODE/METHOD/

This directory must contain:

  • Output files (check formats and filenames)
  • Configuration files (with parameter settings), e.g. config.yaml and samples.csv.
  • logs/ directory with all log files created by the workflow exeuction.

Formats

File formats for the 3 challenges are described in the output specification which also contains the OUTCODE needed for correct naming.

Filenames

As mentioned above it is best to parameterize filenames, such that for each run the names and codes can be set by changing only the sample sheet and config file!

File names must adhere to the following schema: PARAMCODE_METHOD_OUTCODE.ext
For the codes please refer to the following documents:

Example:
AA/MISO/AA_MISO_01.bed would be the output of MISO (your method) for the identification challenge (OUTCODE 01, we know that from execution_output_specification.md), run on dataset "P19" using 4 cores (PARAMCODE AA, we know that from) summary_input_specification.md)