Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow6: Workflow4 + Workflow5 + demo_mv #21

Closed
r06942072 opened this issue Nov 13, 2018 · 6 comments
Closed

Workflow6: Workflow4 + Workflow5 + demo_mv #21

r06942072 opened this issue Nov 13, 2018 · 6 comments
Assignees

Comments

@r06942072
Copy link
Member

r06942072 commented Nov 13, 2018

Link:
https://github.com/NAL-i5K/CWL_Common-Workflow-Language/tree/dev/demo_workflow6

wget
checksums (Workflow5)
gunzip (Workflow4)
tree
mv

@r06942072 r06942072 self-assigned this Nov 13, 2018
@r06942072 r06942072 changed the title Workflow6: three things Workflow6: gitclone, wget, tree, mv Nov 14, 2018
@r06942072 r06942072 changed the title Workflow6: gitclone, wget, tree, mv Workflow6: Workflow4 + Workflow5 + demo_mv Nov 16, 2018
@hsiaoyi0504
Copy link
Member

This comment is related to what I commented earlier in #20.

Current usage of pipeline looks like

cwl-runner 1st-workflow.cwl 1st-workflow-job.yml
cwl-runner block_wget.cwl 1st-workflow-job.yml
cwl-runner block_gunzip.cwl 1st-workflow-job.yml
cwl-runner block_gitclone.cwl 1st-workflow-job.yml
cwl-runner block_tree.cwl 1st-workflow-job.yml
cwl-runner block_mv.cwl 1st-workflow-job.yml

Although it's easy to have another script that collects these command together, the usage of pipeline looks similar what we don't even use cwl.

... do something
wget ...
gunzip ...
git clone ...
tree ...
mv ...

Then, why we need to use CWL ?

The potential reason as what I can see (maybe I am wrong) is that the coverage of functionality of a block we define is too small. That makes no difference between this and we actually run commands one by one. In my imagination, a block in a pipeline should accumulate 5~10 commands together. For example, wget, gunzip should put in a same block together with creating the initial file directory.

@r06942072
Copy link
Member Author

r06942072 commented Nov 26, 2018

  • The ultimate goal of CWL : Done all the work in only one command

  • All the code named as demo_workflowX in CWL repo, is executed in only one single command, like below

cwl-runner 1st-workflow.cwl 1st-workflow-job.yml
  • We provide two arguments to the cwl-runner command
    The first argument: a cwl file with Workflow class, which specify all the external input and the steps.
    The second argument: a yaml file to specify input.

  • Lego brick(Building block) and Lego house(Workflow)
    Workflow is basically combination of Building Block
    I imagine that every building block in cwl is like a lego brick.
    We use bunch of lego bricks to build up a lego house.
    I prefer the small piece of lego brick, because it is able to build a delicate lego house.
    The benefit of using small piece is that it can provide Flexibility to achieve whatever kind of house we want to build.

  • Why my design rule is wrapping only single command in one building block?
    So far, every building block is CommandLineTool class, which only include one linux command.
    It is much easier to develope and for future long-term maintenance.
    We could reuse the building block and put it into the corresponding step in any workflow.

  • One benefit of using CWL
    There is a online tool called CWL-viewer, easy to understand and demo the idea
    CWL-viewer link: https://view.commonwl.org/

@hsiaoyi0504
Copy link
Member

hsiaoyi0504 commented Nov 27, 2018

I still don't get it why we need to create a wrapper for only one linux command.

I agree that it is easy to develop but in terms of long-term maintenance, if we create a wrapper for only one command, that means another layer of complexity and another possible source of bug (compared with directly using the original linux command). Does it really benefit the maintainability, or it hurts the maintainability ?

I prefer the small piece of lego brick, because it is able to build a delicate lego house.
The benefit of using small piece is that it can provide flexibility to achieve whatever kind of house we want to build.

I totally agree that. However, it seems to me that wrapping a single linux command doesn't always boost the flexibility. For ungzip case, it does, because it let us have flexibility to determine where the file should be placed , but I don't see such benefit for mv, cp, wget, cp, and tree.

@r06942072
Copy link
Member Author

r06942072 commented Nov 27, 2018

This is a worth discussing issue:
What is the basic unit should look like in cwl ?

For my design, the reason why there is only one linux command in one block is simple.
The original intention of CommandLineTool class in cwl seems to use one 'basecommand' field to achieve tools isolated.

Helpful Link:
https://www.biostars.org/p/229095/

@hsiaoyi0504
Copy link
Member

hsiaoyi0504 commented Nov 27, 2018

What is the basic unit should look like in cwl ?

It's a great question, but I don't think there is a perfect answer for that.

Although I don't think there is a common answer, in our case, I do have an opinion on it based on projects that are similar to what we are doing (see below links). In my opinion of this project, each block should have a somewhat high-level meaning, rather than how it's implemented.

My observation comes from other existing projects:

Basic units are like prep_align_input, process_align, and postprocess_align.
It's similar to how we prepare a block diagram in a paper. Will you put move (mv) or copy (copy) one file to one directory as a step in your paper's block diagram ?

Probably, we can make use of what we already have. Based on our internal wiki, organism on-boarding have been divided into several steps and it seems to me that each step would not require more than 5 blocks. For example, the first step, set up data directories and get data can be divided into at least two blocks (set_up_data_directories and get_data), but we probably don't want to implement a block like wget one of data files we require in the pipeline.

Another thought is that each block should be unit-testable and worth to be tested (https://github.com/ncbi/pgap/tree/master/progs/unit_tests).

@r06942072
Copy link
Member Author

I think I got your point.
I agree that "At last, we should have high-level meaning on each single block, rather than how it's implemented"
But first I would firstly focus on the CWL could really function and help the automatic organism onboarding pipeline.
Once the workflow is function, the next step is to wrap them into nicer unit, and I believe there is a way to do it.
For example by SubworkflowFeatureRequirement provided in cwl, which is making a workflow of workflow.
We can connect more blocks into a workflow and declare it as an unit when we demo the pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants