-
Notifications
You must be signed in to change notification settings - Fork 935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a graceful shutdown timeout before closing the default ClientFactory
#5718
base: main
Are you sure you want to change the base?
Conversation
…tory` Motivation: The default `ClientFactory` closes immediately when a JVM shuts down. The closed `ClientFactory` prevents the all clients using the default factory from processing requests. This is not a problem when the client is used alone. However, many clients are used in a server to fetch or deliver data when the server receives a request. If the `ClientFactory` is closed while the server is still in graceful shutdown mode, all requests will fail. This behavior cannot be called a graceful shutdown. In this case, an `EndpointSelectionTimeoutException` is raised by `HealthCheckedEndointGroup` due to termination of the default `ClientFactory` used by `HttpHealthChecker`. ```java com.linecorp.armeria.client.UnprocessedRequestException: com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException: Failed to select within 6400 ms an endpoint from: HealthCheckedEndpointGroup{endpoints=[], numEndpoints=0, candidates=[Endpoint{example.com, ipAddr=x.x.x.x, weight=1000}, ...], numCandidates=8, ..., initialized=true, initialSelectionTimeoutMillis=10000, selectionTimeoutMillis=6400, contextGroupChain=[]} at com.linecorp.armeria.client.UnprocessedRequestException.of(UnprocessedRequestException.java:45) at com.linecorp.armeria.client.HttpClientDelegate.earlyFailedResponse(HttpClientDelegate.java:228) ... Caused by: com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException: Failed to select within 6400 ms an endpoint from: HealthCheckedEndpointGroup{endpoints=[], numEndpoints=0, candidates=[Endpoint{example.com, ipAddr=x.x.x.x, weight=1000}, ...], numCandidates=8, ..., initialized=true, initialSelectionTimeoutMillis=10000, selectionTimeoutMillis=6400, contextGroupChain=[]} at com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException.get(EndpointSelectionTimeoutException.java:48) at com.linecorp.armeria.client.endpoint.AbstractEndpointSelector.lambda$select$0(AbstractEndpointSelector.java:117) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ... 8 common frames omitted ``` I propose to add a delay before closing the default `ClientFactory` so that the server handles the request during graceful shutdown. Modifications: - Add `Flags.defaultClientFactoryGracefulShutdownTimeoutMillis()` that indicates the default time to wait before closing the default `ClientFactory`. - Add `TestFlagsProvider` to override `defaultClientFactoryGracefulShutdownTimeoutMillis` for rapid iterative testing. Result: - `HealthCheckedEndpointGroup` no longer raises `EndpointSelectionTimeoutException` when a server is stopped by the JVM shutdown hook. - You can now set a delay before the default `ClientFactory` shuts down via `Flags.defaultClientFactoryGracefulShutdownTimeoutMillis()`. If not set, 10 seconds is used by default.
@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider { | |||
static final String DNS_CACHE_SPEC = "maximumSize=4096"; | |||
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000; | |||
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000; | |||
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if this value was sensible. 😅 A shorter value should be needed for rapid restart and a longer value is useful for a more graceful shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably do it when we reach 2.0:
- Move
ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT
toFlags
- Increase
ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT
value from zero - Use the value for the client factory as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let me create an issue based on your idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question) What do you think of just disabling this timeout by default or making it shorter (1s)?
As mentioned in the PR description, this problem occurs when:
- The
DefaultClientFactory
is used within a server. gracefulShutdown
is used for the server and requests are received during the graceful shutdown period despite the health check signal.Server.closeOnJvmShutdown
is being used to shut down the server
which I don't think is the case for many users.
However, with the current code anyone who has armeria
in their classpath and loads DefaultClientFactory.class
will observe their jvm shutting down after 10s which I'm not sure is good user experience 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jrhee17. We should not wait this long. I'd actually prefer this graceful shutdown feature to be opt-in only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of 10 seconds was derived from Server
's the default request timeout. Since many users want to gracefully shut down the server, they will set the graceful shutdown timeout to a value longer than the request timeout. They don't want to fail requests that come in just before the health check goes down. So, I set the default client factory to 10 seconds, the same as the default request timeout.
I think this part is controversial, so I would like to discuss it further.
Currently the server's graceful shutdown is disabled for this operation, so opt-in seems to be more sensible. However, I would recommend that most LINE internal users enable defaultClientFactoryGracefulShutdownTimeoutMillis()
option, so I question whether opt-in is truly a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For prod, yeah we need it, but for testing and dev, where devs spent most of their time, we probably don't need it? So I'd say we should keep it opt-in only unless we have a notion of 'profile' or 'environment'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. I wasn't sure about the default behavior either. We need to think more about this issue and decide on a direction.
unless we have a notion of 'profile' or 'environment'.
It sounds like a good idea. We may provide a different FlagsProvider
depending on profile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider { | |||
static final String DNS_CACHE_SPEC = "maximumSize=4096"; | |||
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000; | |||
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000; | |||
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably do it when we reach 2.0:
- Move
ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT
toFlags
- Increase
ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT
value from zero - Use the value for the client factory as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good overall 👍 Left a couple questions 🙇
Going forward, I wonder if we may want to maintain a data structure similar to spring's DefaultSingletonBeanRegistry
and define dependencies between Server
and ClientFactory
so that startup/shutdown orders are better defined.
@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider { | |||
static final String DNS_CACHE_SPEC = "maximumSize=4096"; | |||
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000; | |||
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000; | |||
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question) What do you think of just disabling this timeout by default or making it shorter (1s)?
As mentioned in the PR description, this problem occurs when:
- The
DefaultClientFactory
is used within a server. gracefulShutdown
is used for the server and requests are received during the graceful shutdown period despite the health check signal.Server.closeOnJvmShutdown
is being used to shut down the server
which I don't think is the case for many users.
However, with the current code anyone who has armeria
in their classpath and loads DefaultClientFactory.class
will observe their jvm shutting down after 10s which I'm not sure is good user experience 🤔
@@ -80,6 +81,16 @@ final class DefaultClientFactory implements ClientFactory { | |||
try { | |||
Runtime.getRuntime().addShutdownHook(new Thread(() -> { | |||
if (!shutdownHookDisabled) { | |||
final long gracefulShutdownTimeoutMillis = | |||
Flags.defaultClientFactoryGracefulShutdownTimeoutMillis(); | |||
if (gracefulShutdownTimeoutMillis > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is ignored if you manually close the default {@linkplain ClientFactory} with {@link ClientFactory#closeDefault()}
I saw this in the javadocs, is this true? 🤔
https://github.com/jrhee17/armeria/tree/poc/long-jvm-shutdown
...
14:21:16.132 [main] INFO com.linecorp.armeria.common.Flags - Using Tls engine: OpenSSL BoringSSL, 0x1010107f
14:21:16.165 [main] DEBUG c.l.armeria.client.ClientFactory - Closed the default client factories
14:21:26.178 [Thread-1] DEBUG c.l.armeria.client.ClientFactory - Closing the default client factories
14:21:26.179 [Thread-1] DEBUG c.l.armeria.client.ClientFactory - Closed the default client factories
Flags.defaultClientFactoryGracefulShutdownTimeoutMillis(); | ||
if (gracefulShutdownTimeoutMillis > 0) { | ||
try { | ||
Thread.sleep(gracefulShutdownTimeoutMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Shouldn't we allow a user specify the graceful shutdown timeout for each
ClientFactory
? This PR adds the sleep only to the default factory, which doesn't sound right. - What should we do when a user sends a request during this period? Should we reject them?
- Can we skip waiting when there are no pending requests? Do we need grace period just like we have on the server side shutdown? Just adding a fixed amount of sleep doesn't sound right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we allow a user specify the graceful shutdown timeout for each ClientFactory?
Good idea.
Can we skip waiting when there are no pending requests? Do we need grace period just like we have on the server side shutdown? Just adding a fixed amount of sleep doesn't sound right.
If the server allows new requests for 10 seconds considering inflight requests and health check interval, the client should not be closed even if there are no active requests on the client.
What should we do when a user sends a request during this period? Should we reject them?
Since the server controls incoming traffic through health checks, I am not sure if it is right for the client to reject new requests.
The more I think about it, the problem cannot be solved without linking the lifecycles of the server and client.
@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider { | |||
static final String DNS_CACHE_SPEC = "maximumSize=4096"; | |||
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000; | |||
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000; | |||
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jrhee17. We should not wait this long. I'd actually prefer this graceful shutdown feature to be opt-in only.
…gration Motivation: I tried to add a graceful shutdown timeout to not immediately close the default `ClientFactory`. line#5718 (comment) However, it is a breaking change and no agreement has been reached on the default behavior. I think that line#5718 may take some time to be merged or may not be merged soon. This problem can be solved more easily in Spring Boot. Spring Boot starts the application in the main method and terminates with JVM shutdown. By shutting down the default `ClientFactory` using `SpringApplicationShutdownHook`, the graceful shutdown can be supported without breaking changes for Spring integration, which is used by most users. Modifications: - Use `SpringApplicationShutdownHandlers` to close the default `ClientFactory` for Spring integration. Result: In Spring integration, the default `Clientfactory` is gracefully closed after the server is shut down.
…gration (#5742) Motivation: I tried to add a graceful shutdown timeout to not immediately close the default `ClientFactory`. #5718 (comment) However, it is a breaking change and no agreement has been reached on the default behavior. I think that #5718 may take some time to be merged or may not be merged. This problem can be solved more easily in Spring Boot. Spring Boot starts the application in the main method and terminates with JVM shutdown. By shutting down the default `ClientFactory` using `SpringApplicationShutdownHook`, the graceful shutdown can be supported without breaking changes for Spring integration, which is used by most users. Modifications: - Use `SpringApplicationShutdownHandlers` to close the default `ClientFactory` for Spring integration. Result: In Spring integration, the default `ClientFactory` is now gracefully closed after the server is shut down.
Motivation:
The default
ClientFactory
closes immediately when a JVM shuts down. The closedClientFactory
prevents all clients using the default factory from processing requests.This is not a problem when the client is used alone. However, many clients are used in a server to fetch or deliver data when the server receives a request. If the
ClientFactory
is closed while the server is still in graceful shutdown mode, all requests will fail. This behavior cannot be called a graceful shutdown.In this case, an
EndpointSelectionTimeoutException
is raised byHealthCheckedEndointGroup
due to the termination of the defaultClientFactory
used byHttpHealthChecker
.I propose to add a delay before closing the default
ClientFactory
so that the server handles the request during graceful shutdown.Modifications:
Flags.defaultClientFactoryGracefulShutdownTimeoutMillis()
that indicates the default time to wait before closing the defaultClientFactory
.TestFlagsProvider
to overridedefaultClientFactoryGracefulShutdownTimeoutMillis
for rapid iterative testing.Result:
HealthCheckedEndpointGroup
no longer raisesEndpointSelectionTimeoutException
when a server is stopped by the JVM shutdown hook.ClientFactory
shuts down viaFlags.defaultClientFactoryGracefulShutdownTimeoutMillis()
. If not set, 10 seconds is used by default.