Is maxErrorRetry applied to K8sResponseException 500 (tunnel disconnect)? #5604

alanbchristie · 2024-12-13T07:05:57Z

Bug report

This issue was originally raised as #5574 against NF 22.10.8 but having re-written the process for DSL2 and running with NF 24.10.2 it persists, so I'm (re-)raising a new issue here

Expected behavior and actual behavior

Nextflow receives a 500 error from the kubernetes API but it would be nice to retry (as I think the error is transient). We are using maxErrorRetry and NF 24.10.2 but this does not appear to help.

The Pod was running and actually finished successfully about 15 minutes later (as shown by the k8s Pod information of the Pod left undeleted): -

  containerStatuses:
    - name: nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d
      state:
        terminated:
          exitCode: 0
          reason: Completed
          startedAt: '2024-12-12T21:05:05Z'
          finishedAt: '2024-12-13T00:46:54Z'

...but the response to the (in our case a short-term) 500 error appears catastrophic.

Steps to reproduce the problem

We run our workflow and it fails with this error regularly, at almost exactly the same time each day ... 00:33:31. In the exception last night it occurred again at 00:33:31. Although we are trying to determine the cause of the underlying disconnect it would be nice if NF could comply with maxErrorRetry. If it is there's no evidence in the log.

Program output

Dec-13 00:33:31.980 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=fragment:fragment_mols (888); work-dir=/work/nextflow/0b/2286d5f91a1bb1f5a67b9c8d3ada45
  error [nextflow.k8s.client.K8sResponseException]: Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect
Dec-13 00:33:31.997 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'fragment:fragment_mols (888)'

Caused by:
  Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect


nextflow.k8s.client.K8sResponseException: Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect
	at nextflow.k8s.client.K8sClient.makeRequestCall(K8sClient.groovy:674)
	at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy:630)
	at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy)
	at nextflow.k8s.client.K8sClient.get(K8sClient.groovy:683)
	at nextflow.k8s.client.K8sClient.podStatus(K8sClient.groovy:319)
	at nextflow.k8s.client.K8sClient.podStatus0(K8sClient.groovy:330)
	at nextflow.k8s.client.K8sClient.podState(K8sClient.groovy:463)
	at nextflow.k8s.K8sTaskHandler.getState(K8sTaskHandler.groovy:343)
	at nextflow.k8s.K8sTaskHandler.checkIfCompleted(K8sTaskHandler.groovy:413)
	at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:651)
	at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:573)
	at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:443)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:318)
	at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
	at groovy.lang.Closure.run(Closure.java:505)
	at java.base/java.lang.Thread.run(Thread.java:829)
Dec-13 00:33:32.005 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect

Environment

Nextflow version: 24.10.2
Java version: openjdk 11.0.25 2024-10-15
Operating system: Debian GNU/Linux 11 (bullseye)
Bash version: GNU bash, version 5.1.4(1)-release (x86_64-pc-linux-gnu)

Additional context

Running in an OpenStack kubernetes cluster (v1.30) and the following nextflow config: -

process {
  pod = [
    nodeSelector: 'informaticsmatters.com/purpose-fragmentor=yes',
    imagePullPolicy: 'Always'
  ]
}
executor {
    name = 'k8s'
    queueSize = 400
}
k8s {
    httpConnectTimeout = '120s'
    httpReadTimeout = '120s'
    maxErrorRetry = '8'
    serviceAccount = 'fragmentor'
    storageClaimName = 'work'
    storageMountPath = '/work'
}

The text was updated successfully, but these errors were encountered:

bentsherman · 2024-12-13T11:49:06Z

If it's returning a 500 error code then Nextflow should be retrying it:

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/client/K8sClient.groovy

Lines 629 to 639 in 0aa76af

    
           try { 
        
               return makeRequestCall( method, path, body ) 
        
           } catch ( K8sResponseException | SocketException | SocketTimeoutException e ) { 
        
               if ( e instanceof K8sResponseException && e.response.code != 500 ) 
        
                   throw e 
        
               if ( ++attempt > maxRetries ) 
        
                   throw e 
        
               log.debug "[K8s] API request threw socket exception: $e.message for $method $path - Retrying request (attempt=$attempt)" 
        
               final long delay = (Math.pow(3, attempt - 1) as long) * 250 
        
               sleep( delay ) 
        
           }

I wonder if it is retrying but the tunnel disconnect lasts longer than the duration of the 8 retries. Seems plausible if it is a regular outage. Can you see the "API request threw socket exception..." messages in your log?

alanbchristie · 2024-12-13T12:10:02Z

There are no such messages so it looks like maxRetries is not being set as expected?

Can I know whether nextflow is reading the maxErrorRetry = '8' value from the $HOME/.nextflow/config file?

In case it's significant, I am also starting nextflow with an explicit -c command (to a file that is empty) ... but I understand that configurations are merged so using an empty file will not disturb anything read from $HOME/.nextflow/config? After all - it must be reading the config described otherwise it wouldn't know to use a Kubernetes executor? Here's the intiial log indicating that it's found the files...

Dec-12 17:47:07.443 [main] DEBUG nextflow.cli.CmdRun - N E X T F L O W  ~  version 24.10.2
Dec-12 17:47:07.480 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/root/.nextflow/plugins; core-plugins: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
Dec-12 17:47:07.517 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Dec-12 17:47:07.519 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Dec-12 17:47:07.522 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.12.0 in 'deployment' mode
Dec-12 17:47:07.534 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Dec-12 17:47:07.555 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /root/.nextflow/config
Dec-12 17:47:07.557 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /root/nextflow/nextflow.config
Dec-12 17:47:07.560 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /root/ansible/roles/fragment/../../../nextflow/nextflow.config
Dec-12 17:47:07.562 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/.nextflow/config
Dec-12 17:47:07.563 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Dec-12 17:47:07.563 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Dec-12 17:47:07.587 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /root/.nextflow/secrets/store.json
Dec-12 17:47:07.590 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@39ce27f2] - activable => nextflow.secret.LocalSecretsProvider@39ce27f2
Dec-12 17:47:07.624 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Dec-12 17:47:08.173 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Dec-12 17:47:08.211 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Dec-12 17:47:08.248 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declaration

It's the /root/nextflow/nextflow.config file that exists but is empty.

The $HOME/.nextflow/config is shown in the Additional Context in the initial issue text above.

alanbchristie · 2024-12-13T12:29:24Z

Looking at the code, using maxErrorRetry = '8' means that the final attempt (8) will be 3^7 x 250` (mS)? i.e. 2,187 x 250 or about 9 minutes, and I think the total period for 8 retries is be about 14 minutes.

This should be sufficient but it's clear the retries are not being used.

And the documentation for the parameter says Defines the Kubernetes API max request retries (default: 4) but clearly the default isn't used either.

Just a thought ... it might be worth explaining the 'math' to the user for the parameter so they understand how long a given number of retries will take? And maybe provide a small example table of retries and anticipated maximum retry period.

bentsherman · 2024-12-13T13:05:45Z

You can run nextflow config ... to verify the resolved config. But yes I would expect other problems if that config wasn't being included

You can also try setting the retry as a number:

k8s.maxErrorRetry = 8

But I don't expect it to matter because there is some logic to handle strings:

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/K8sConfig.groovy

Lines 236 to 237 in 0aa76af

    
           if( target.maxErrorRetry ) 
        
               result.maxErrorRetry = target.maxErrorRetry as Integer

alanbchristie · 2024-12-13T13:10:32Z

OK, thanks. It does look like it's picking up my configuration...

# nextflow config
process {
   pod = [nodeSelector:'informaticsmatters.com/purpose-fragmentor=yes', imagePullPolicy:'Always']
}

executor {
   name = 'k8s'
   queueSize = 600
}

k8s {
   httpConnectTimeout = '120s'
   httpReadTimeout = '120s'
   maxErrorRetry = '8'
   serviceAccount = 'fragmentor'
   storageClaimName = 'work'
   storageMountPath = '/work'
}

alanbchristie · 2024-12-13T13:14:43Z

The other thing that might be of interest is that the 'base' nextflow (as I am running it in a container) may not be 24.10.2 but it is running in an environment where NXF_VER=24.10.2 (and the log clearly indicates 24.10.2 is the version that is running).

So...

If the config code is running and it does not find a target.maxErrorRetry entry in the config why isn't it 4 (the documented default)? And if it is finding a value why isn't it 8?

bentsherman added the executor/k8s label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is maxErrorRetry applied to K8sResponseException 500 (tunnel disconnect)? #5604

Is maxErrorRetry applied to K8sResponseException 500 (tunnel disconnect)? #5604

alanbchristie commented Dec 13, 2024 •

edited

Loading

bentsherman commented Dec 13, 2024

alanbchristie commented Dec 13, 2024 •

edited

Loading

alanbchristie commented Dec 13, 2024 •

edited

Loading

bentsherman commented Dec 13, 2024

alanbchristie commented Dec 13, 2024

alanbchristie commented Dec 13, 2024 •

edited

Loading

Is maxErrorRetry applied to K8sResponseException 500 (tunnel disconnect)? #5604

Is maxErrorRetry applied to K8sResponseException 500 (tunnel disconnect)? #5604

Comments

alanbchristie commented Dec 13, 2024 • edited Loading

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

bentsherman commented Dec 13, 2024

alanbchristie commented Dec 13, 2024 • edited Loading

alanbchristie commented Dec 13, 2024 • edited Loading

bentsherman commented Dec 13, 2024

alanbchristie commented Dec 13, 2024

alanbchristie commented Dec 13, 2024 • edited Loading

So...

alanbchristie commented Dec 13, 2024 •

edited

Loading

alanbchristie commented Dec 13, 2024 •

edited

Loading

alanbchristie commented Dec 13, 2024 •

edited

Loading

alanbchristie commented Dec 13, 2024 •

edited

Loading