Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is maxErrorRetry applied to K8sResponseException 500 (tunnel disconnect)? #5604

Open
alanbchristie opened this issue Dec 13, 2024 · 6 comments

Comments

@alanbchristie
Copy link

alanbchristie commented Dec 13, 2024

Bug report

This issue was originally raised as #5574 against NF 22.10.8 but having re-written the process for DSL2 and running with NF 24.10.2 it persists, so I'm (re-)raising a new issue here

Expected behavior and actual behavior

Nextflow receives a 500 error from the kubernetes API but it would be nice to retry (as I think the error is transient). We are using maxErrorRetry and NF 24.10.2 but this does not appear to help.

The Pod was running and actually finished successfully about 15 minutes later (as shown by the k8s Pod information of the Pod left undeleted): -

  containerStatuses:
    - name: nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d
      state:
        terminated:
          exitCode: 0
          reason: Completed
          startedAt: '2024-12-12T21:05:05Z'
          finishedAt: '2024-12-13T00:46:54Z'

...but the response to the (in our case a short-term) 500 error appears catastrophic.

Steps to reproduce the problem

We run our workflow and it fails with this error regularly, at almost exactly the same time each day ... 00:33:31. In the exception last night it occurred again at 00:33:31. Although we are trying to determine the cause of the underlying disconnect it would be nice if NF could comply with maxErrorRetry. If it is there's no evidence in the log.

Program output

Dec-13 00:33:31.980 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=fragment:fragment_mols (888); work-dir=/work/nextflow/0b/2286d5f91a1bb1f5a67b9c8d3ada45
  error [nextflow.k8s.client.K8sResponseException]: Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect
Dec-13 00:33:31.997 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'fragment:fragment_mols (888)'

Caused by:
  Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect


nextflow.k8s.client.K8sResponseException: Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect
	at nextflow.k8s.client.K8sClient.makeRequestCall(K8sClient.groovy:674)
	at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy:630)
	at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy)
	at nextflow.k8s.client.K8sClient.get(K8sClient.groovy:683)
	at nextflow.k8s.client.K8sClient.podStatus(K8sClient.groovy:319)
	at nextflow.k8s.client.K8sClient.podStatus0(K8sClient.groovy:330)
	at nextflow.k8s.client.K8sClient.podState(K8sClient.groovy:463)
	at nextflow.k8s.K8sTaskHandler.getState(K8sTaskHandler.groovy:343)
	at nextflow.k8s.K8sTaskHandler.checkIfCompleted(K8sTaskHandler.groovy:413)
	at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:651)
	at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:573)
	at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:443)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:318)
	at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
	at groovy.lang.Closure.run(Closure.java:505)
	at java.base/java.lang.Thread.run(Thread.java:829)
Dec-13 00:33:32.005 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Request GET /api/v1/namespaces/fragmentor/pods/nf-0b2286d5f91a1bb1f5a67b9c8d3ada45-c297d/status returned an error code=500 -- error trying to reach service: tunnel disconnect

Environment

  • Nextflow version: 24.10.2
  • Java version: openjdk 11.0.25 2024-10-15
  • Operating system: Debian GNU/Linux 11 (bullseye)
  • Bash version: GNU bash, version 5.1.4(1)-release (x86_64-pc-linux-gnu)

Additional context

Running in an OpenStack kubernetes cluster (v1.30) and the following nextflow config: -

process {
  pod = [
    nodeSelector: 'informaticsmatters.com/purpose-fragmentor=yes',
    imagePullPolicy: 'Always'
  ]
}
executor {
    name = 'k8s'
    queueSize = 400
}
k8s {
    httpConnectTimeout = '120s'
    httpReadTimeout = '120s'
    maxErrorRetry = '8'
    serviceAccount = 'fragmentor'
    storageClaimName = 'work'
    storageMountPath = '/work'
}
@bentsherman
Copy link
Member

If it's returning a 500 error code then Nextflow should be retrying it:

try {
return makeRequestCall( method, path, body )
} catch ( K8sResponseException | SocketException | SocketTimeoutException e ) {
if ( e instanceof K8sResponseException && e.response.code != 500 )
throw e
if ( ++attempt > maxRetries )
throw e
log.debug "[K8s] API request threw socket exception: $e.message for $method $path - Retrying request (attempt=$attempt)"
final long delay = (Math.pow(3, attempt - 1) as long) * 250
sleep( delay )
}

I wonder if it is retrying but the tunnel disconnect lasts longer than the duration of the 8 retries. Seems plausible if it is a regular outage. Can you see the "API request threw socket exception..." messages in your log?

@alanbchristie
Copy link
Author

alanbchristie commented Dec 13, 2024

There are no such messages so it looks like maxRetries is not being set as expected?

Can I know whether nextflow is reading the maxErrorRetry = '8' value from the $HOME/.nextflow/config file?

In case it's significant, I am also starting nextflow with an explicit -c command (to a file that is empty) ... but I understand that configurations are merged so using an empty file will not disturb anything read from $HOME/.nextflow/config? After all - it must be reading the config described otherwise it wouldn't know to use a Kubernetes executor? Here's the intiial log indicating that it's found the files...

Dec-12 17:47:07.443 [main] DEBUG nextflow.cli.CmdRun - N E X T F L O W  ~  version 24.10.2
Dec-12 17:47:07.480 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/root/.nextflow/plugins; core-plugins: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
Dec-12 17:47:07.517 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Dec-12 17:47:07.519 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Dec-12 17:47:07.522 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.12.0 in 'deployment' mode
Dec-12 17:47:07.534 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Dec-12 17:47:07.555 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /root/.nextflow/config
Dec-12 17:47:07.557 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /root/nextflow/nextflow.config
Dec-12 17:47:07.560 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /root/ansible/roles/fragment/../../../nextflow/nextflow.config
Dec-12 17:47:07.562 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/.nextflow/config
Dec-12 17:47:07.563 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Dec-12 17:47:07.563 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /root/nextflow/nextflow.config
Dec-12 17:47:07.587 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /root/.nextflow/secrets/store.json
Dec-12 17:47:07.590 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@39ce27f2] - activable => nextflow.secret.LocalSecretsProvider@39ce27f2
Dec-12 17:47:07.624 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Dec-12 17:47:08.173 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Dec-12 17:47:08.211 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard`
Dec-12 17:47:08.248 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declaration

It's the /root/nextflow/nextflow.config file that exists but is empty.

The $HOME/.nextflow/config is shown in the Additional Context in the initial issue text above.

@alanbchristie
Copy link
Author

alanbchristie commented Dec 13, 2024

Looking at the code, using maxErrorRetry = '8' means that the final attempt (8) will be 3^7 x 250` (mS)? i.e. 2,187 x 250 or about 9 minutes, and I think the total period for 8 retries is be about 14 minutes.

This should be sufficient but it's clear the retries are not being used.

And the documentation for the parameter says Defines the Kubernetes API max request retries (default: 4) but clearly the default isn't used either.

Just a thought ... it might be worth explaining the 'math' to the user for the parameter so they understand how long a given number of retries will take? And maybe provide a small example table of retries and anticipated maximum retry period.

@bentsherman
Copy link
Member

You can run nextflow config ... to verify the resolved config. But yes I would expect other problems if that config wasn't being included

You can also try setting the retry as a number:

k8s.maxErrorRetry = 8

But I don't expect it to matter because there is some logic to handle strings:

if( target.maxErrorRetry )
result.maxErrorRetry = target.maxErrorRetry as Integer

@alanbchristie
Copy link
Author

OK, thanks. It does look like it's picking up my configuration...

# nextflow config
process {
   pod = [nodeSelector:'informaticsmatters.com/purpose-fragmentor=yes', imagePullPolicy:'Always']
}

executor {
   name = 'k8s'
   queueSize = 600
}

k8s {
   httpConnectTimeout = '120s'
   httpReadTimeout = '120s'
   maxErrorRetry = '8'
   serviceAccount = 'fragmentor'
   storageClaimName = 'work'
   storageMountPath = '/work'
}

@alanbchristie
Copy link
Author

alanbchristie commented Dec 13, 2024

The other thing that might be of interest is that the 'base' nextflow (as I am running it in a container) may not be 24.10.2 but it is running in an environment where NXF_VER=24.10.2 (and the log clearly indicates 24.10.2 is the version that is running).

So...

If the config code is running and it does not find a target.maxErrorRetry entry in the config why isn't it 4 (the documented default)? And if it is finding a value why isn't it 8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants