-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoint: improve HTTP error handling #6130
base: develop
Are you sure you want to change the base?
Conversation
When recovering from an URL, only continue without failing if checkpoint data request fails with 404 code. If the request fails with other 4xx HTTP error code, go-spacemesh startup fails. Upon other errors, including 5xx HTTP errors and network errors, retry checkpoint data request several times. `recovery.retry-count` and `recovery.retry-interval` specify retry parameters, defaulting to 5 and 3s, respectively.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6130 +/- ##
=======================================
Coverage 81.7% 81.8%
=======================================
Files 302 301 -1
Lines 32637 32620 -17
=======================================
+ Hits 26695 26710 +15
+ Misses 4204 4181 -23
+ Partials 1738 1729 -9 ☔ View full report in Codecov by Sentry. |
ab5cdfb
to
05a5cf6
Compare
When this flag is specified, go-spacemesh doesn't exit when recovery URI request fails. Retries are still done, though. This is needed b/c in the context of TestRecovery systest, all the nodes are created with --recovery-uri flag passed, including the bootnodes. This way, they can't successfully start unless the bootstrapper is already running. But the bootstrapper requires an working go-spacemesh node to start up, thus we get a circular dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we want to default to ignoring failing requests for checkpoints in system tests? Especially in the tests that do a checkpoint and restore these should be caught and logged and cause the test to fail early (failing to fetch the checkpoint) instead of having the restore "succeed" and then fail at assertion.
I think instead we need to make sure that the pods serving the checkpoints are up and available before starting the smeshing nodes. Wdys?
The problem is more complicated than that, I explained it in the commit message but perhaps should comment in the code: This is needed b/c in the context of TestRecovery systest, all the nodes are created with --recovery-uri flag passed, including the bootnodes. This way, they can't successfully start unless the bootstrapper is already running. But the bootstrapper requires an working go-spacemesh node to start up, thus we get a circulardependency. This circular dependency is rather hard to resolve, it involves avoiding bootstrapper restart but that's not easily doable b/c it will involve some substantial restructuring of systest code |
Added a comment explaining why go-spacemesh needs to continue after unsuccessfully retrying checkpoint recovery several times. We may want to fix this unfortunate circular dependency of node-bootstrapper startup in a followup. |
Motivation
When
recovery.recovery-uri
is configured and an error happens, such as network glitch or a temporary server error due to overload, go-spacemesh continues its execution without recovery, which may not be always intended behavior. This, for example, causesTestCheckpoint
systest flakes sometimes.The way to fix it is to retry a few times upon transient errors (such as network glitch or a 503 server error), terminating after all the retries are exhausted, continue without an error upon
404 Not Found
, and terminate go-spacemesh in other cases.Description
When recovering from an URL, only continue without failing if checkpoint data request fails with 404 code. If the request fails with other 4xx HTTP error code, go-spacemesh startup fails. Upon other errors, including 5xx HTTP errors and network errors, retry checkpoint data request several times.
recovery.retry-max
andrecovery.retry-delay
specify retry parameters, defaulting to 5 and 3s, respectively. go-spacemesh terminates if all the retries fail.Test Plan
Systests should pass