Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume not working on Azure Batch #3605

Open
alliemclean opened this issue Feb 3, 2023 · 30 comments
Open

Resume not working on Azure Batch #3605

alliemclean opened this issue Feb 3, 2023 · 30 comments

Comments

@alliemclean
Copy link

Bug report

resume option works locally but not with azure batch. I've gone through all the troubleshooting steps and can't find the reason the cache is invalidated

Expected behavior and actual behavior

resume should use the cache and not re-process cached jobs

Steps to reproduce the problem

link to input

nextflow run pipeline_test.nf \
 --name "resume_test_azure" \
 --in az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz \
 --lines 10000 \
 --outdir az://pipeline/resume_results/ \
 -w az://pipeline/resume_working/ \
 -c azure.config \
 -dump-hashes \
 -resume

Program output

Uploaded logs for local and batch runs, run command lines, and config files:
test_resume.tar.gz

  • nextflow.log - the azure batch run that should have resumed and did not (-dump-hashes was used)
  • nextflow.log.1.log - the original azure batch run (-dump-hashes was used)
  • nextflow.log.2.log - the local run which resumed correctly
  • nextflow.log.3.log - the original local run

Environment

  • Nextflow version: 22.10.6.5843
  • Java version: 17.0.6 2023-01-17 LTS
  • Operating system: linux
  • Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
  • azure plugin: [email protected]

Additional context

resume works with local execution and not with azure. Cache hashes are all the same as far as I can tell.

resume is also working on batch (same account, same settings) with a different pipeline.

@abhi18av
Copy link
Member

abhi18av commented Feb 6, 2023

Thanks for letting us know about this issue @alliemclean , allow me some time to circle back once I've investigated this further.

@abhi18av
Copy link
Member

abhi18av commented Feb 8, 2023

Hey @alliemclean , circling back on this one after a lot of attempts unfortunately I'm not seeing this occur in my env.

image


That being said, I did tweak the configuraition file to remove the fileshare - which shouldn't affect anything in the pipeline since that's not being referenced in the analysis.

@alliemclean
Copy link
Author

alliemclean commented Feb 9, 2023

Hi @abhi18av I tried removing the fileshare and it still didn't resume. I'm having the same problem with 2 out of 3 pipelines I'm running. The other one resumes one of the jobs in the pipeline but not the rest. The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple. Could this be the issue? Is there a way to check whether the cache has a problem with the process's output?

Could there be anything within our azure environment that is causing the difference in this behavior?

@abhi18av
Copy link
Member

abhi18av commented Feb 9, 2023

Mmm, that might be the case - but it could also be depending on the design of those two pipelines.

The difference between the re-run jobs and the correctly resumed job is that the one that resumed only has 1 output file whereas the others have multiple.

Hmm, I don't particularly see this should break the caching behavior 🤔

Is it a public NF pipeline which you are running? Perhaps an nf-core one which I could use to test things as well?

CC @vsmalladi for any inputs/prior-experience here?

@alliemclean
Copy link
Author

It's not an nf-core one, but I suppose I could see if I can reproduce it with one, although the example I've given is complete and sufficient to cause the issue I'm having on my end. I can try to reproduce with this module, snpsift_split when I have time.

@vsmalladi
Copy link
Contributor

@alliemclean can you quote the az parts

nextflow run pipeline_test.nf \ --name "resume_test_azure" \ --in "az://pipeline/input/GCA_000001215.4_current_ids.vcf.gz" \ --lines 10000 \ --outdir "az://pipeline/resume_results/" \ -w "az://pipeline/resume_working/" \ -c azure.config \ -dump-hashes \ -resume

@mawasak
Copy link

mawasak commented Feb 28, 2023

@alliemclean Are you running different pipelines (with and without "-resume") from the same folder? Nextflow get some values from .nextflow/history file. So maybe try run Azure from different folder if so.

Try to run pipeline with parameter "-trace nextflow" e.g. nextflow -trace nextflow run hello.nf -resume ....
Next in .nextflow.log check line "Cacheable folder=" (is it path or null)?

@alliemclean
Copy link
Author

This is the second run and it didn't use the cache. I tried in it's own folder. Here some parts of the log I thought might be relevant:

image

image

@mawasak
Copy link

mawasak commented Mar 30, 2023

I think that the problem may be connected with exitCode for task . Only "0" is success code. The exit code is one of the components that allows the task to be treated as cacheable.

The log on a first screenshot is ok.

@alliemclean
Copy link
Author

It was successful, but I ran it many times so this one could have been after an unsuccessful run. Here's another try after a successful run, but Cacheable folder=null. Is there a reason it may not pick it up? actual run folder doesn't have the az:// prefix:

Apr-21 17:33:45.559 [Actor Thread 6] TRACE nextflow.processor.TaskProcessor - Poison pill arrived; port: 0
Apr-21 17:33:45.562 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - Process state changed to: StateObj[submitted: 1; completed: 0; poisoned: true ] -- finished: false
Apr-21 17:33:45.736 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] Store dir not set -- return false
Apr-21 17:33:47.253 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] Cacheable folder=null -- exists=false; try=1; shouldTryCache=false; entry=null
Apr-21 17:33:47.725 [Actor Thread 3] TRACE nextflow.processor.TaskProcessor - [split (1)] actual run folder: /pipeline/pipeline_name/6f/d30552c750ac633db6590e57493f5f

@FabianDK
Copy link

Hi everyone,

I have the same issue. Running a very simple script with azure batch that is not being cached.

In my case, I believe that my nextflow run identifies the previously successfully run workflow, but for some reason does not find the output file (.txt) although it is in the workdir (on the azure storage).

Jun-28 17:51:41.619 [Actor Thread 5] TRACE nextflow.processor.TaskProcessor - [MAKE_TEST_FILE_HOST] Missed cache > Missing output file(s) `*.txt` expected by process `MAKE_TEST_FILE_HOST` -- folder: /nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7
Jun-28 17:51:41.620 [Actor Thread 5] TRACE nextflow.file.FileSystemPathFactory - File system path factories: [nextflow.cloud.azure.file.AzPathFactory@436bd4df]

I assume an explanation could be that nextflow is trying to look at a local workDir folder instead of the one in the cloud (az://nextflow-scratch/work/2d/10f3f31f8dd2a93a04895f2d6dfcd7).

@adamrtalbot
Copy link
Collaborator

adamrtalbot commented Aug 1, 2023

I may have found the problem here, could the people seeing this confirm which type of storage account they are using? Are they using Data Lake gen2 storage?

If so, could people have a look inside the cache dir on Azure storage and tell me if there are any contents?

@FabianDK
Copy link

FabianDK commented Aug 3, 2023

We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?

The workDir and its contents are present on the Azure storage (config setting: workDir = "az://blob_container/folder/nf_work/") and so are the publishDir/copy output files (config setting: params.outdir = "az://blob_container/folder/data_tmp/nf_output"). We are not using the fileShares option.

After running a more complex RNAseq workflow, it is interesting to note that only the first 2 processes are cached. Unclear why exactly those are recognized as "already ran", while all others not.

@pditommaso
Copy link
Member

I'm not 100% sure that Data lake storage is fully compatible with Blob storage

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction#built-on-azure-blob-storage

@adamrtalbot
Copy link
Collaborator

We are using ADLS Gen 2. The chache folder (.nextflow/cache) is only generated locally and does not appear on the Azure storage. Could that be the problem?

Yes, I think so. I will generate a reproducible example shortly.

As Paolo says, Azure Data Lake is not 100% compatible with Blob storage despite being built on top of it, this breaks the caching system of Nextflow. Unfortunately Microsoft gloss over this in their documentation.

@vsmalladi
Copy link
Contributor

@adamrtalbot , i think it might have to do with hierarchical name space. We could setup an engineering meeting and see if we can solve this.

@adamrtalbot
Copy link
Collaborator

I'm fairly sure it's because the .azure_blob_dir empty files in the Nextflow cache. Syncing either way breaks it on Data Lake. Haven't had a chance to make an example today.

@adamrtalbot
Copy link
Collaborator

If you run a Nextflow task on Azure using Data Lake gen 2 storage, it will work fine but Nextflow will be unable to cache the backup:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: java.nio.file.FileAlreadyExistsException: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099/b08f9190-09b6-40b4-8858-424b85f13070/.azure_blob_dir

If I clear out the remote directory and delete all .azure_blob_dir it works fine:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099

However Nextflow has added lots of .azure_blob_dir to the folder. If I run again I get another error even though the .azure_blob_dir files do not exist at source:

nextflow -log .nextflow.cache.log -config-ignore-includes plugin nf-tower:cache-backup >> >(tee -a $NXF_OUT_FILE) 2>&1 
Saving cache: .nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 => az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099
WARN: Failed to backup resume metadata to remote store path: az://output/scratch/13OxksF29KdZ90/.nextflow/cache/7dde5148-558d-4e73-8bda-3cc86c176099 — cause: com.azure.storage.blob.models.BlobStorageException: Status code 400, "<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidUri</Code><Message>The requested URI does not represent any resource on the server.
RequestId:e9d8006a-d01e-0004-0a35-c9f35b000000
Time:2023-08-07T13:42:33.3380342Z</Message></Error>"

I can't say for sure this is the problem but it seems likely. For now, use a normal Azure Blob storage account and only use a Data Lake if you don't mind losing the resume function.

@bentsherman
Copy link
Member

Adam, can you try a run with the cloud cache? This will save the cache directly in object storage, no need to use the cache backup/restore commands. To use it simply set the NXF_CLOUDCACHE_PATH environment var:

export NXF_CLOUDCACHE_PATH="az://my-bucket/cache"

@adamrtalbot
Copy link
Collaborator

OK first run of Nextflow hello with, then run again with -resume:
nextflow.azuredatalake.2.log
nextflow.azuredatalake.1.log

@bentsherman
Copy link
Member

Sorry, I only just realized that the cloud cache isn't being auto-enabled. I just made a PR to fix it, but for now you also have to enable the plugin explicitly:

plugins {
    id 'nf-cloudcache'
}

You should see the following line in your log:

DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.CloudCacheFactory

@adamrtalbot
Copy link
Collaborator

@bentsherman
Copy link
Member

Looks good

Aug-07 18:52:05.304 [Actor Thread 8] INFO  nextflow.processor.TaskProcessor - [df/934fdb] Cached process > sayHello (3)
Aug-07 18:52:05.566 [Actor Thread 10] INFO  nextflow.processor.TaskProcessor - [39/2edd7e] Cached process > sayHello (4)
Aug-07 18:52:05.574 [Actor Thread 6] INFO  nextflow.processor.TaskProcessor - [94/3c76ee] Cached process > sayHello (2)
Aug-07 18:52:05.656 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [61/695cf7] Cached process > sayHello (1)

@bentsherman
Copy link
Member

Paolo is preparing a launcher image so that we can use the cloud cache in Tower... likely it will become the default for all cloud providers, so should resolve this issue. Nextflow-only users can also use the cloud cache with the above settings.

@FabianDK
Copy link

FabianDK commented Aug 10, 2023

I have just tried the solution you proposed and it works:

  1. I added the plugins in the config:
plugins {
        id 'nf-azure'
        id 'nf-cloudcache'
}
  1. I run the workflow with version 23.07.0 (with resume = true in config):
NXF_VER=23.07.0-edge NXF_CLOUDCACHE_PATH="az://path_on_adls2" nextflow -log local_log_folder/nf_log local/workflow.nf -c local/nextflow.config

The caching then works. Thank you for the solution!

Another comment that I have is regarding the 'nf-cloudcache' plugin. We are usually developing workflows locally on premise with a config profile option for local execution, and another profile for azure batch execution.

When the 'nf-cloudcache' plugin is specified in the config and we do a local execution without specifying a cloud cache dir, we get: ERROR ~ NXF_CLOUDCACHE_PATH must be defined when using the path-based cache store .

A solution would be to make plugins conditional on a chosen config profile, but this is currently not possible.

@bentsherman
Copy link
Member

Hi @FabianDK , we just made a change to do exactly this, the plugin will be enabled automatically if NXF_CLOUDCACHE_PATH is set. That will be available in the next release.

@pditommaso pditommaso added this to the 23.10.0 milestone Sep 3, 2023
@adamrtalbot adamrtalbot reopened this May 16, 2024
@pditommaso
Copy link
Member

@adamrtalbot
Copy link
Collaborator

@alliemclean is still reporting this error on Slack. I've been unable to reproduce it with the following example (test data from original post).

main.nf:

#!/usr/bin/env nextflow
/**********************************************************************
 * Parameter defaults
 **********************************************************************/
params.lines=100000
params.in = "az://container/GCA_000001215.4_current_ids.vcf.gz"
params.name = "test_name"

/**********************************************************************
 * Splits a VCF file in multiple files.
 *   One file per params.lines variants
 **********************************************************************/
process split {

    machineType "Standard_E*d_v5" // Not important, just for our system

    cpus 1
    memory '1GB'

    container 'lethalfang/tabix:1.10'
    cache = 'deep'

    input:
    path(f)

    output:
    file("${params.name}_*.vcf")

    shell:
    """
    echo "Decompressing ${f}"
    bgzip -cd ${f} > temp.vcf
    sed -n '/^#/p' temp.vcf > header.vcf
    echo "Splitting temp.vcf"
    split -l ${params.lines} temp.vcf vcf_
    echo "Done splitting"

    counter=0
    for file in vcf_*
    do
        sed -i '/^#/d' \$file
        cat header.vcf \$file > ${params.name}_\$counter.vcf
        let counter=counter+1
    done
    """
}
/*********************************************************************
 *  Main workflow
 *********************************************************************/
workflow {

    // General genomics inputs
    def vcf = Channel.fromPath(params.in, checkIfExists: true)

    main:

        // Split the VCF by number of lines (ie., 50000)
        split(vcf)
}

nextflow.config

process.executor = 'azurebatch'
azure {
    storage {
        accountName = 'mybatchaccount'
    }
    batch {
        location = 'eastus'
        accountName = 'mystorageaccount'
        copyToolInstallMode = 'node'
        autoPoolMode = true
        allowPoolCreation = true
        deletePoolsOnCompletion = false
		auto {
			autoScale = false
			vmCount = 1
		}
    }
    activeDirectory {
        servicePrincipalId     = "$AZURE_DIRECTORY_TENANT_ID"
        servicePrincipalSecret = "$AZURE_SERVICE_PRINCIPAL_SECRET"
        tenantId               = "$AZURE_APPLICATION_TENANT_ID" 
    }
}
workDir = 'az://work/'

@landroutsosAIE
Copy link

landroutsosAIE commented Jul 16, 2024

Hello guys! We fixed our initial problem of not being able to resume using the NXF_CLOUDCACHE_PATH, but now we can't resume the pipeline because the pools are deleted upon completion.
We are having the error: Pool not in active state. I understand that is normal behaviour since the pool where the process ran is deleted We dont want to have a lot of pools active at the same time, since the Azure costs add up, but we would love to be able to resume any pipelines that didnt run.
Do you have any ideas how to fix this resume error?

@adamrtalbot
Copy link
Collaborator

adamrtalbot commented Jul 16, 2024

This will re-use the existing pools while not costing you any additional money:

  • Turn on pool autoscaling so the nodes will spin down to zero after work completes. The pool itself is free as long as there are no nodes.
  • Turn off azure.batch.deletePoolsOnCompletion so it does not delete the pools and can reuse them after each run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants