-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume not working on Azure Batch #3605
Comments
Thanks for letting us know about this issue @alliemclean , allow me some time to circle back once I've investigated this further. |
Hey @alliemclean , circling back on this one after a lot of attempts unfortunately I'm not seeing this occur in my env. That being said, I did tweak the configuraition file to remove the fileshare - which shouldn't affect anything in the pipeline since that's not being referenced in the analysis. |
Hi @abhi18av I tried removing the fileshare and it still didn't resume. I'm having the same problem with 2 out of 3 pipelines I'm running. The other one resumes one of the jobs in the pipeline but not the rest. The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple. Could this be the issue? Is there a way to check whether the cache has a problem with the process's output? Could there be anything within our azure environment that is causing the difference in this behavior? |
Mmm, that might be the case - but it could also be depending on the design of those two pipelines.
Hmm, I don't particularly see this should break the caching behavior 🤔 Is it a public NF pipeline which you are running? Perhaps an nf-core one which I could use to test things as well? CC @vsmalladi for any inputs/prior-experience here? |
It's not an nf-core one, but I suppose I could see if I can reproduce it with one, although the example I've given is complete and sufficient to cause the issue I'm having on my end. I can try to reproduce with this module, snpsift_split when I have time. |
@alliemclean can you quote the az parts
|
@alliemclean Are you running different pipelines (with and without "-resume") from the same folder? Nextflow get some values from .nextflow/history file. So maybe try run Azure from different folder if so. Try to run pipeline with parameter "-trace nextflow" e.g. nextflow -trace nextflow run hello.nf -resume .... |
I think that the problem may be connected with exitCode for task . Only "0" is success code. The exit code is one of the components that allows the task to be treated as cacheable. The log on a first screenshot is ok. |
It was successful, but I ran it many times so this one could have been after an unsuccessful run. Here's another try after a successful run, but Cacheable folder=null. Is there a reason it may not pick it up? actual run folder doesn't have the az:// prefix: Apr-21 17:33:45.559 [Actor Thread 6] TRACE nextflow.processor.TaskProcessor - Poison pill arrived; port: 0 |
Hi everyone, I have the same issue. Running a very simple script with azure batch that is not being cached. In my case, I believe that my nextflow run identifies the previously successfully run workflow, but for some reason does not find the output file (.txt) although it is in the workdir (on the azure storage).
I assume an explanation could be that nextflow is trying to look at a local workDir folder instead of the one in the cloud (az://nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7). |
I may have found the problem here, could the people seeing this confirm which type of storage account they are using? Are they using Data Lake gen2 storage? If so, could people have a look inside the cache dir on Azure storage and tell me if there are any contents? |
We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem? The workDir and its contents are present on the Azure storage (config setting: After running a more complex RNAseq workflow, it is interesting to note that only the first 2 processes are cached. Unclear why exactly those are recognized as "already ran", while all others not. |
I'm not 100% sure that Data lake storage is fully compatible with Blob storage |
Yes, I think so. I will generate a reproducible example shortly. As Paolo says, Azure Data Lake is not 100% compatible with Blob storage despite being built on top of it, this breaks the caching system of Nextflow. Unfortunately Microsoft gloss over this in their documentation. |
@adamrtalbot , i think it might have to do with hierarchical name space. We could setup an engineering meeting and see if we can solve this. |
I'm fairly sure it's because the |
If you run a Nextflow task on Azure using Data Lake gen 2 storage, it will work fine but Nextflow will be unable to cache the backup:
If I clear out the remote directory and delete all
However Nextflow has added lots of
I can't say for sure this is the problem but it seems likely. For now, use a normal Azure Blob storage account and only use a Data Lake if you don't mind losing the resume function. |
Adam, can you try a run with the cloud cache? This will save the cache directly in object storage, no need to use the cache backup/restore commands. To use it simply set the export NXF_CLOUDCACHE_PATH="az://my-bucket/cache" |
OK first run of Nextflow hello with, then run again with |
Sorry, I only just realized that the cloud cache isn't being auto-enabled. I just made a PR to fix it, but for now you also have to enable the plugin explicitly: plugins {
id 'nf-cloudcache'
} You should see the following line in your log:
|
Looks good
|
Paolo is preparing a launcher image so that we can use the cloud cache in Tower... likely it will become the default for all cloud providers, so should resolve this issue. Nextflow-only users can also use the cloud cache with the above settings. |
I have just tried the solution you proposed and it works:
The caching then works. Thank you for the solution! Another comment that I have is regarding the 'nf-cloudcache' plugin. We are usually developing workflows locally on premise with a config profile option for local execution, and another profile for azure batch execution. When the 'nf-cloudcache' plugin is specified in the config and we do a local execution without specifying a cloud cache dir, we get: A solution would be to make plugins conditional on a chosen config profile, but this is currently not possible. |
Hi @FabianDK , we just made a change to do exactly this, the plugin will be enabled automatically if |
@alliemclean is still reporting this error on Slack. I've been unable to reproduce it with the following example (test data from original post). main.nf: #!/usr/bin/env nextflow
/**********************************************************************
* Parameter defaults
**********************************************************************/
params.lines=100000
params.in = "az://container/GCA_000001215.4_current_ids.vcf.gz"
params.name = "test_name"
/**********************************************************************
* Splits a VCF file in multiple files.
* One file per params.lines variants
**********************************************************************/
process split {
machineType "Standard_E*d_v5" // Not important, just for our system
cpus 1
memory '1GB'
container 'lethalfang/tabix:1.10'
cache = 'deep'
input:
path(f)
output:
file("${params.name}_*.vcf")
shell:
"""
echo "Decompressing ${f}"
bgzip -cd ${f} > temp.vcf
sed -n '/^#/p' temp.vcf > header.vcf
echo "Splitting temp.vcf"
split -l ${params.lines} temp.vcf vcf_
echo "Done splitting"
counter=0
for file in vcf_*
do
sed -i '/^#/d' \$file
cat header.vcf \$file > ${params.name}_\$counter.vcf
let counter=counter+1
done
"""
}
/*********************************************************************
* Main workflow
*********************************************************************/
workflow {
// General genomics inputs
def vcf = Channel.fromPath(params.in, checkIfExists: true)
main:
// Split the VCF by number of lines (ie., 50000)
split(vcf)
} nextflow.config process.executor = 'azurebatch'
azure {
storage {
accountName = 'mybatchaccount'
}
batch {
location = 'eastus'
accountName = 'mystorageaccount'
copyToolInstallMode = 'node'
autoPoolMode = true
allowPoolCreation = true
deletePoolsOnCompletion = false
auto {
autoScale = false
vmCount = 1
}
}
activeDirectory {
servicePrincipalId = "$AZURE_DIRECTORY_TENANT_ID"
servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
tenantId = "$AZURE_APPLICATION_TENANT_ID"
}
}
workDir = 'az://work/' |
Hello guys! We fixed our initial problem of not being able to resume using the NXF_CLOUDCACHE_PATH, but now we can't resume the pipeline because the pools are deleted upon completion. |
This will re-use the existing pools while not costing you any additional money:
|
Bug report
resume option works locally but not with azure batch. I've gone through all the troubleshooting steps and can't find the reason the cache is invalidated
Expected behavior and actual behavior
resume should use the cache and not re-process cached jobs
Steps to reproduce the problem
link to input
nextflow run pipeline_test.nf \ --name "resume_test_azure" \ --in az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz \ --lines 10000 \ --outdir az://pipeline/resume_results/ \ -w az://pipeline/resume_working/ \ -c azure.config \ -dump-hashes \ -resume
Program output
Uploaded logs for local and batch runs, run command lines, and config files:
test_resume.tar.gz
Environment
Additional context
resume works with local execution and not with azure. Cache hashes are all the same as far as I can tell.
resume is also working on batch (same account, same settings) with a different pipeline.
The text was updated successfully, but these errors were encountered: