Resume not working on Azure Batch #3605

alliemclean · 2023-02-03T16:34:52Z

Bug report

resume option works locally but not with azure batch. I've gone through all the troubleshooting steps and can't find the reason the cache is invalidated

Expected behavior and actual behavior

resume should use the cache and not re-process cached jobs

Steps to reproduce the problem

link to input

nextflow run pipeline_test.nf \
 --name "resume_test_azure" \
 --in az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz \
 --lines 10000 \
 --outdir az://pipeline/resume_results/ \
 -w az://pipeline/resume_working/ \
 -c azure.config \
 -dump-hashes \
 -resume

Program output

Uploaded logs for local and batch runs, run command lines, and config files:
test_resume.tar.gz

nextflow.log - the azure batch run that should have resumed and did not (-dump-hashes was used)
nextflow.log.1.log - the original azure batch run (-dump-hashes was used)
nextflow.log.2.log - the local run which resumed correctly
nextflow.log.3.log - the original local run

Environment

Nextflow version: 22.10.6.5843
Java version: 17.0.6 2023-01-17 LTS
Operating system: linux
Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
azure plugin: [email protected]

Additional context

resume works with local execution and not with azure. Cache hashes are all the same as far as I can tell.

resume is also working on batch (same account, same settings) with a different pipeline.

abhi18av · 2023-02-06T13:58:11Z

Thanks for letting us know about this issue @alliemclean , allow me some time to circle back once I've investigated this further.

abhi18av · 2023-02-08T17:45:48Z

Hey @alliemclean , circling back on this one after a lot of attempts unfortunately I'm not seeing this occur in my env.

That being said, I did tweak the configuraition file to remove the fileshare - which shouldn't affect anything in the pipeline since that's not being referenced in the analysis.

alliemclean · 2023-02-09T14:55:24Z

Hi @abhi18av I tried removing the fileshare and it still didn't resume. I'm having the same problem with 2 out of 3 pipelines I'm running. The other one resumes one of the jobs in the pipeline but not the rest. The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple. Could this be the issue? Is there a way to check whether the cache has a problem with the process's output?

Could there be anything within our azure environment that is causing the difference in this behavior?

abhi18av · 2023-02-09T18:07:01Z

Mmm, that might be the case - but it could also be depending on the design of those two pipelines.

The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple.

Hmm, I don't particularly see this should break the caching behavior 🤔

Is it a public NF pipeline which you are running? Perhaps an nf-core one which I could use to test things as well?

CC @vsmalladi for any inputs/prior-experience here?

alliemclean · 2023-02-09T21:06:25Z

It's not an nf-core one, but I suppose I could see if I can reproduce it with one, although the example I've given is complete and sufficient to cause the issue I'm having on my end. I can try to reproduce with this module, snpsift_split when I have time.

vsmalladi · 2023-02-10T17:50:03Z

@alliemclean can you quote the az parts

nextflow run pipeline_test.nf \ --name "resume_test_azure" \ --in "az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz" \ --lines 10000 \ --outdir "az://pipeline/resume_results/" \ -w "az://pipeline/resume_working/" \ -c azure.config \ -dump-hashes \ -resume

mawasak · 2023-02-28T20:14:23Z

@alliemclean Are you running different pipelines (with and without "-resume") from the same folder? Nextflow get some values from .nextflow/history file. So maybe try run Azure from different folder if so.

Try to run pipeline with parameter "-trace nextflow" e.g. nextflow -trace nextflow run hello.nf -resume ....
Next in .nextflow.log check line "Cacheable folder=" (is it path or null)?

alliemclean · 2023-03-14T14:40:17Z

This is the second run and it didn't use the cache. I tried in it's own folder. Here some parts of the log I thought might be relevant:

mawasak · 2023-03-30T07:52:25Z

I think that the problem may be connected with exitCode for task . Only "0" is success code. The exit code is one of the components that allows the task to be treated as cacheable.

The log on a first screenshot is ok.

alliemclean · 2023-04-21T17:48:42Z

It was successful, but I ran it many times so this one could have been after an unsuccessful run. Here's another try after a successful run, but Cacheable folder=null. Is there a reason it may not pick it up? actual run folder doesn't have the az:// prefix:

Apr-21 17:33:45.559 [Actor Thread 6] TRACE nextflow.processor.TaskProcessor - Poison pill arrived; port: 0
Apr-21 17:33:45.562 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - Process state changed to: StateObj[submitted: 1; completed: 0; poisoned: true ] -- finished: false
Apr-21 17:33:45.736 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] Store dir not set -- return false
Apr-21 17:33:47.253 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] Cacheable folder=null -- exists=false; try=1; shouldTryCache=false; entry=null
Apr-21 17:33:47.725 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] actual run folder: /pipeline/pipeline_name/6f/d30552c750ac633db6590e57493f5f

FabianDK · 2023-06-28T17:10:54Z

Hi everyone,

I have the same issue. Running a very simple script with azure batch that is not being cached.

In my case, I believe that my nextflow run identifies the previously successfully run workflow, but for some reason does not find the output file (.txt) although it is in the workdir (on the azure storage).

Jun-28 17:51:41.619 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - [MAKE_TEST_FILE_HOST] Missed cache > Missing output file(s) `*.txt` expected by process `MAKE_TEST_FILE_HOST` -- folder: /nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7
Jun-28 17:51:41.620 [Actor Thread 5] TRACE nextflow.file.FileSystemPathFactory - File system path factories: [nextflow.cloud.azure.file.AzPathFactory@436bd4df]

I assume an explanation could be that nextflow is trying to look at a local workDir folder instead of the one in the cloud (az://nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7).

adamrtalbot · 2023-08-01T08:41:46Z

I may have found the problem here, could the people seeing this confirm which type of storage account they are using? Are they using Data Lake gen2 storage?

If so, could people have a look inside the cache dir on Azure storage and tell me if there are any contents?

FabianDK · 2023-08-03T09:53:27Z

We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?

The workDir and its contents are present on the Azure storage (config setting: workDir = "az://blob_container/folder/nf_work/") and so are the publishDir/copy output files (config setting: params.outdir = "az://blob_container/folder/data_tmp/nf_output"). We are not using the fileShares option.

After running a more complex RNAseq workflow, it is interesting to note that only the first 2 processes are cached. Unclear why exactly those are recognized as "already ran", while all others not.

pditommaso · 2023-08-03T10:00:25Z

I'm not 100% sure that Data lake storage is fully compatible with Blob storage

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction#built-on-azure-blob-storage

adamrtalbot · 2023-08-04T08:49:24Z

We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?

Yes, I think so. I will generate a reproducible example shortly.

As Paolo says, Azure Data Lake is not 100% compatible with Blob storage despite being built on top of it, this breaks the caching system of Nextflow. Unfortunately Microsoft gloss over this in their documentation.

vsmalladi · 2023-08-04T14:05:37Z

@adamrtalbot , i think it might have to do with hierarchical name space. We could setup an engineering meeting and see if we can solve this.

adamrtalbot · 2023-08-04T17:12:56Z

I'm fairly sure it's because the .azure_blob_dir empty files in the Nextflow cache. Syncing either way breaks it on Data Lake. Haven't had a chance to make an example today.

adamrtalbot · 2023-08-07T13:45:07Z

If you run a Nextflow task on Azure using Data Lake gen 2 storage, it will work fine but Nextflow will be unable to cache the backup:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: java.nio.file.FileAlreadyExistsException: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099/b08f9190-09b6-40b4-8858-424b85f13070/.azure_blob_dir

If I clear out the remote directory and delete all .azure_blob_dir it works fine:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099

However Nextflow has added lots of .azure_blob_dir to the folder. If I run again I get another error even though the .azure_blob_dir files do not exist at source:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:e9d8006a-d01e-0004-0a35-c9f35b000000
Time:2023-08-07T13:42:33.3380342Z</Message></Error>"

I can't say for sure this is the problem but it seems likely. For now, use a normal Azure Blob storage account and only use a Data Lake if you don't mind losing the resume function.

bentsherman · 2023-08-07T14:06:24Z

Adam, can you try a run with the cloud cache? This will save the cache directly in object storage, no need to use the cache backup/restore commands. To use it simply set the NXF_CLOUDCACHE_PATH environment var:

export NXF_CLOUDCACHE_PATH="az://my-bucket/cache"

adamrtalbot · 2023-08-07T16:23:42Z

OK first run of Nextflow hello with, then run again with -resume:
nextflow.azuredatalake.2.log
nextflow.azuredatalake.1.log

bentsherman · 2023-08-07T16:40:20Z

Sorry, I only just realized that the cloud cache isn't being auto-enabled. I just made a PR to fix it, but for now you also have to enable the plugin explicitly:

plugins {
    id 'nf-cloudcache'
}

You should see the following line in your log:

DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.CloudCacheFactory

adamrtalbot · 2023-08-07T17:55:36Z

Take 2!

nextflow.azuredatalake.2.log
nextflow.azuredatalake.1.log

bentsherman · 2023-08-07T18:08:24Z

Looks good

Aug-07 18:52:05.304 [Actor Thread 8] INFO  nextflow.processor.TaskProcessor - [df/934fdb] Cached process > sayHello (3)
Aug-07 18:52:05.566 [Actor Thread 10] INFO  nextflow.processor.TaskProcessor - [39/2edd7e] Cached process > sayHello (4)
Aug-07 18:52:05.574 [Actor Thread 6] INFO  nextflow.processor.TaskProcessor - [94/3c76ee] Cached process > sayHello (2)
Aug-07 18:52:05.656 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [61/695cf7] Cached process > sayHello (1)

bentsherman · 2023-08-07T18:12:43Z

Paolo is preparing a launcher image so that we can use the cloud cache in Tower... likely it will become the default for all cloud providers, so should resolve this issue. Nextflow-only users can also use the cloud cache with the above settings.

FabianDK · 2023-08-10T10:07:46Z

I have just tried the solution you proposed and it works:

I added the plugins in the config:

plugins {
        id 'nf-azure'
        id 'nf-cloudcache'
}

I run the workflow with version 23.07.0 (with resume = true in config):

NXF_VER=23.07.0-edge NXF_CLOUDCACHE_PATH="az://path_on_adls2" nextflow -log local_log_folder/nf_log local/workflow.nf -c local/nextflow.config

The caching then works. Thank you for the solution!

Another comment that I have is regarding the 'nf-cloudcache' plugin. We are usually developing workflows locally on premise with a config profile option for local execution, and another profile for azure batch execution.

When the 'nf-cloudcache' plugin is specified in the config and we do a local execution without specifying a cloud cache dir, we get: ERROR ~ NXF_CLOUDCACHE_PATH must be defined when using the path-based cache store .

A solution would be to make plugins conditional on a chosen config profile, but this is currently not possible.

bentsherman · 2023-08-10T13:33:31Z

Hi @FabianDK , we just made a change to do exactly this, the plugin will be enabled automatically if NXF_CLOUDCACHE_PATH is set. That will be available in the next release.

pditommaso · 2024-05-16T16:47:27Z

adamrtalbot · 2024-05-16T17:10:37Z

@alliemclean is still reporting this error on Slack. I've been unable to reproduce it with the following example (test data from original post).

main.nf:

#!/usr/bin/env nextflow
/**********************************************************************
 * Parameter defaults
 **********************************************************************/
params.lines=100000
params.in = "az://container/GCA_000001215.4_current_ids.vcf.gz"
params.name = "test_name"

/**********************************************************************
 * Splits a VCF file in multiple files.
 *   One file per params.lines variants
 **********************************************************************/
process split {

    machineType "Standard_E*d_v5" // Not important, just for our system

    cpus 1
    memory '1GB'

    container 'lethalfang/tabix:1.10'
    cache = 'deep'

    input:
    path(f)

    output:
    file("${params.name}_*.vcf")

    shell:
    """
    echo "Decompressing ${f}"
    bgzip -cd ${f} > temp.vcf
    sed -n '/^#/p' temp.vcf > header.vcf
    echo "Splitting temp.vcf"
    split -l ${params.lines} temp.vcf vcf_
    echo "Done splitting"

    counter=0
    for file in vcf_*
    do
        sed -i '/^#/d' \$file
        cat header.vcf \$file > ${params.name}_\$counter.vcf
        let counter=counter+1
    done
    """
}
/*********************************************************************
 *  Main workflow
 *********************************************************************/
workflow {

    // General genomics inputs
    def vcf = Channel.fromPath(params.in, checkIfExists: true)

    main:

        // Split the VCF by number of lines (ie., 50000)
        split(vcf)
}

nextflow.config

process.executor = 'azurebatch'
azure {
    storage {
        accountName = 'mybatchaccount'
    }
    batch {
        location = 'eastus'
        accountName = 'mystorageaccount'
        copyToolInstallMode = 'node'
        autoPoolMode = true
        allowPoolCreation = true
        deletePoolsOnCompletion = false
		auto {
			autoScale = false
			vmCount = 1
		}
    }
    activeDirectory {
        servicePrincipalId     = "$AZURE_DIRECTORY_TENANT_ID"
        servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
        tenantId               = "$AZURE_APPLICATION_TENANT_ID" 
    }
}
workDir = 'az://work/'

landroutsosAIE · 2024-07-16T16:50:39Z

Hello guys! We fixed our initial problem of not being able to resume using the NXF_CLOUDCACHE_PATH, but now we can't resume the pipeline because the pools are deleted upon completion.
We are having the error: Pool not in active state. I understand that is normal behaviour since the pool where the process ran is deleted We dont want to have a lot of pools active at the same time, since the Azure costs add up, but we would love to be able to resume any pipelines that didnt run.
Do you have any ideas how to fix this resume error?

adamrtalbot · 2024-07-16T17:04:08Z

This will re-use the existing pools while not costing you any additional money:

Turn on pool autoscaling so the nodes will spin down to zero after work completes. The pool itself is free as long as there are no nodes.
Turn off azure.batch.deletePoolsOnCompletion so it does not delete the pools and can reuse them after each run.

abhi18av added the executor/azure-batch label Feb 6, 2023

abhi18av added the triage/needs-information label Feb 8, 2023

bentsherman removed the triage/needs-information label Aug 10, 2023

pditommaso closed this as completed Sep 3, 2023

pditommaso added this to the 23.10.0 milestone Sep 3, 2023

adamrtalbot reopened this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume not working on Azure Batch #3605

Resume not working on Azure Batch #3605

alliemclean commented Feb 3, 2023

abhi18av commented Feb 6, 2023

abhi18av commented Feb 8, 2023

alliemclean commented Feb 9, 2023 •

edited

Loading

abhi18av commented Feb 9, 2023

alliemclean commented Feb 9, 2023

vsmalladi commented Feb 10, 2023

mawasak commented Feb 28, 2023

alliemclean commented Mar 14, 2023

mawasak commented Mar 30, 2023

alliemclean commented Apr 21, 2023

FabianDK commented Jun 28, 2023

adamrtalbot commented Aug 1, 2023 •

edited

Loading

FabianDK commented Aug 3, 2023 •

edited

Loading

pditommaso commented Aug 3, 2023

adamrtalbot commented Aug 4, 2023

vsmalladi commented Aug 4, 2023

adamrtalbot commented Aug 4, 2023

adamrtalbot commented Aug 7, 2023

bentsherman commented Aug 7, 2023

adamrtalbot commented Aug 7, 2023

bentsherman commented Aug 7, 2023

adamrtalbot commented Aug 7, 2023

bentsherman commented Aug 7, 2023

bentsherman commented Aug 7, 2023

FabianDK commented Aug 10, 2023 •

edited

Loading

bentsherman commented Aug 10, 2023

pditommaso commented May 16, 2024

adamrtalbot commented May 16, 2024

landroutsosAIE commented Jul 16, 2024 •

edited

Loading

adamrtalbot commented Jul 16, 2024 •

edited

Loading

Resume not working on Azure Batch #3605

Resume not working on Azure Batch #3605

Comments

alliemclean commented Feb 3, 2023

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

abhi18av commented Feb 6, 2023

abhi18av commented Feb 8, 2023

alliemclean commented Feb 9, 2023 • edited Loading

abhi18av commented Feb 9, 2023

alliemclean commented Feb 9, 2023

vsmalladi commented Feb 10, 2023

mawasak commented Feb 28, 2023

alliemclean commented Mar 14, 2023

mawasak commented Mar 30, 2023

alliemclean commented Apr 21, 2023

FabianDK commented Jun 28, 2023

adamrtalbot commented Aug 1, 2023 • edited Loading

FabianDK commented Aug 3, 2023 • edited Loading

pditommaso commented Aug 3, 2023

adamrtalbot commented Aug 4, 2023

vsmalladi commented Aug 4, 2023

adamrtalbot commented Aug 4, 2023

adamrtalbot commented Aug 7, 2023

bentsherman commented Aug 7, 2023

adamrtalbot commented Aug 7, 2023

bentsherman commented Aug 7, 2023

adamrtalbot commented Aug 7, 2023

bentsherman commented Aug 7, 2023

bentsherman commented Aug 7, 2023

FabianDK commented Aug 10, 2023 • edited Loading

bentsherman commented Aug 10, 2023

pditommaso commented May 16, 2024

adamrtalbot commented May 16, 2024

landroutsosAIE commented Jul 16, 2024 • edited Loading

adamrtalbot commented Jul 16, 2024 • edited Loading

alliemclean commented Feb 9, 2023 •

edited

Loading

adamrtalbot commented Aug 1, 2023 •

edited

Loading

FabianDK commented Aug 3, 2023 •

edited

Loading

FabianDK commented Aug 10, 2023 •

edited

Loading

landroutsosAIE commented Jul 16, 2024 •

edited

Loading

adamrtalbot commented Jul 16, 2024 •

edited

Loading