-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offload publishing to separate jobs #5590
Comments
I'd suggest to focus on POC for S3 only. The copy should be done by adding behind the scene a regular nextflow process that will take care of copying files across buckets. When a file needs to be copied from a S3 path to another S3 pat, the file path is sent to the "copy" process and the copyPath is skipped. The process could be defined as process copyTask {
container 'public.cr.seqera.io/wave/s5cmd:v2.2.2'
input:
tuple path(source), path(target)
'''
s5cmd cp $source $target
'''
} In a second iteration, we can see how to use fusion in place of s5cmd, and batching multiple copies in the same copy task run. |
Here is the code in nf-boost for implementing a process on-the-fly: https://github.com/bentsherman/nf-boost/blob/main/plugins/nf-boost/src/main/nextflow/boost/ops/ExecOp.groovy It might be overkill because it will modify the DAG. But it's a good summary of how a process is constructed end-to-end Instead I would start by copying the bits that you need from TaskProcessor into a separate thing. I can show you how to construct the nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy Line 623 in b23e42c
We'll also need to handle the task finalization. The task monitor normally calls |
Following the example provided by @bentsherman, I have been able to do a initial PoC with minimal changes. I have programmatically created the TaskProcessor and invoke a task when a copy is requested. I have currently tested locally (forcing to copy files with mode:'copy') and doing the cp in a task instead of calling FileHelper.copyPath. The new process appear in the list but not in the DAG. We can comment in next meeting to check if I miss something important before doing a first PR. In the next step I am going to modify it to support the s3 with s5cmd as @pditommaso suggested |
The publishing of workflow outputs is performed natively by Nextflow with a thread pool in PublishDir. This logic is used both for the
publishDir
directive and the workflow output definition.The publish throughput is limited by the thread pool size, which by default is cpus * 3. This limit can be increased, but even then, Nextflow can be limited by the network bandwidth and API limits for the head node. For example, an EC2 instance will have a limited amount of network bandwidth and API limits against S3 which will limit the number of concurrent API calls that it can effectively perform, even if it is only doing an S3-to-S3 copy.
The goal of this issue is to offload publishing operations to separate jobs, in order to increase the total amount of network bandwidth and API calls available to the pipeline.
Try to reuse the existing logic for submitting compute jobs as much as possible
As a first iteration, just use Fusion and don't worry about mounting the various CLI tools for object storage providers
As a first iteration, focus on AWS / Azure / Google, especially AWS. This feature is intended mainly for the cloud executors.
Try to batch publish operations, for now we can expose a config setting for the batch size and tune it manually
Failed publish jobs should be retried in a similar manner to PublishDir
Out of scope (possible future goals):
Supporting without Fusion. In the future we could use Wave to provide the cloud CLI tools as we do for Fargate.
Offloading the staging of remote inputs (i.e. FilePorter). The FilePorter typically isn't needed when using Fusion, so probably would only be useful if we relax the Fusion requirement
cc @pditommaso for comments
@jorgee let me know if you need any further clarifications
The text was updated successfully, but these errors were encountered: