Skip to content

Commit

Permalink
healthcheck: exclude hosts when receiving x-envoy-immediate-health-ch…
Browse files Browse the repository at this point in the history
…eck-fail (envoyproxy#14772)

* Send x-envoy-immediate-health-check-fail on all responses that the
  health check filter processes, not just non-HC responses.
* Exclude hosts from load balancing when x-envoy-immediate-health-check-fail
  is received.
* Can be reverted via the envoy.reloadable_features.health_check.immediate_failure_exclude_from_cluster
  feature flag.

Fixes envoyproxy#9246

Signed-off-by: Matt Klein <[email protected]>
  • Loading branch information
mattklein123 authored Feb 1, 2021
1 parent 241a955 commit deed328
Show file tree
Hide file tree
Showing 39 changed files with 505 additions and 225 deletions.
9 changes: 8 additions & 1 deletion api/envoy/admin/v3/clusters.proto
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ message HostStatus {
}

// Health status for a host.
// [#next-free-field: 7]
// [#next-free-field: 9]
message HostHealthStatus {
option (udpa.annotations.versioning).previous_message_type =
"envoy.admin.v2alpha.HostHealthStatus";
Expand All @@ -160,6 +160,13 @@ message HostHealthStatus {
// The host has not yet been health checked.
bool pending_active_hc = 6;

// The host should be excluded from panic, spillover, etc. calculations because it was explicitly
// taken out of rotation via protocol signal and is not meant to be routed to.
bool excluded_via_immediate_hc_fail = 7;

// The host failed active HC due to timeout.
bool active_hc_timeout = 8;

// Health status as reported by EDS. Note: only HEALTHY and UNHEALTHY are currently supported
// here.
// [#comment:TODO(mrice32): pipe through remaining EDS health status possibilities.]
Expand Down
9 changes: 8 additions & 1 deletion api/envoy/admin/v4alpha/clusters.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 3 additions & 19 deletions api/envoy/config/cluster/v3/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -536,25 +536,9 @@ message Cluster {
// https://github.com/envoyproxy/envoy/pull/3941.
google.protobuf.Duration update_merge_window = 4;

// If set to true, Envoy will not consider new hosts when computing load balancing weights until
// they have been health checked for the first time. This will have no effect unless
// active health checking is also configured.
//
// Ignoring a host means that for any load balancing calculations that adjust weights based
// on the ratio of eligible hosts and total hosts (priority spillover, locality weighting and
// panic mode) Envoy will exclude these hosts in the denominator.
//
// For example, with hosts in two priorities P0 and P1, where P0 looks like
// {healthy, unhealthy (new), unhealthy (new)}
// and where P1 looks like
// {healthy, healthy}
// all traffic will still hit P0, as 1 / (3 - 2) = 1.
//
// Enabling this will allow scaling up the number of hosts for a given cluster without entering
// panic mode or triggering priority spillover, assuming the hosts pass the first health check.
//
// If panic mode is triggered, new hosts are still eligible for traffic; they simply do not
// contribute to the calculation when deciding whether panic mode is enabled or not.
// If set to true, Envoy will :ref:`exclude <arch_overview_load_balancing_excluded>` new hosts
// when computing load balancing weights until they have been health checked for the first time.
// This will have no effect unless active health checking is also configured.
bool ignore_new_hosts_until_first_hc = 5;

// If set to `true`, the cluster manager will drain all existing
Expand Down
22 changes: 3 additions & 19 deletions api/envoy/config/cluster/v4alpha/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions api/envoy/data/core/v3/health_check_event.proto
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ enum HealthCheckFailureType {
ACTIVE = 0;
PASSIVE = 1;
NETWORK = 2;
NETWORK_TIMEOUT = 3;
}

enum HealthCheckerType {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Health check

Note that the filter will automatically fail health checks and set the
:ref:`x-envoy-immediate-health-check-fail
<config_http_filters_router_x-envoy-immediate-health-check-fail>` header if the
:ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin endpoint has been
called. (The :ref:`/healthcheck/ok <operations_admin_interface_healthcheck_ok>` admin endpoint
reverses this behavior).
<config_http_filters_router_x-envoy-immediate-health-check-fail>` header on all responses (both
health check and normal requests) if the :ref:`/healthcheck/fail
<operations_admin_interface_healthcheck_fail>` admin endpoint has been called. (The
:ref:`/healthcheck/ok <operations_admin_interface_healthcheck_ok>` admin endpoint reverses this
behavior).
13 changes: 7 additions & 6 deletions docs/root/configuration/http/http_filters/router_filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ x-envoy-upstream-rq-timeout-alt-response
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Setting this header will cause Envoy to set a 204 response code (instead of 504) in the event of a request timeout.
The actual value of the header is ignored; only its presence is considered. See also
The actual value of the header is ignored; only its presence is considered. See also
:ref:`config_http_filters_router_x-envoy-upstream-rq-timeout-ms`.

.. _config_http_filters_router_x-envoy-upstream-rq-timeout-ms:
Expand Down Expand Up @@ -294,11 +294,12 @@ x-envoy-immediate-health-check-fail

If the upstream host returns this header (set to any value), Envoy will immediately assume the
upstream host has failed :ref:`active health checking <arch_overview_health_checking>` (if the
cluster has been :ref:`configured <config_cluster_manager_cluster_hc>` for active health checking).
This can be used to fast fail an upstream host via standard data plane processing without waiting
for the next health check interval. The host can become healthy again via standard active health
checks. See the :ref:`health checking overview <arch_overview_health_checking>` for more
information.
cluster has been :ref:`configured <config_cluster_manager_cluster_hc>` for active health checking)
and :ref:`exclude <arch_overview_load_balancing_excluded>` it from load balancing. This can be used
to fast fail an upstream host via standard data plane processing without waiting for the next health
check interval. The host can become healthy again via standard active health checks. See the
:ref:`active health checking fast failure overview <arch_overview_health_checking_fast_failure>` for
more information.

.. _config_http_filters_router_x-envoy-ratelimited:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Every cluster has a statistics tree rooted at *cluster.<name>.* with the followi
upstream_rq_active, Gauge, Total active requests
upstream_rq_pending_total, Counter, Total requests pending a connection pool connection
upstream_rq_pending_overflow, Counter, Total requests that overflowed connection pool or requests (mainly for HTTP/2) circuit breaking and were failed
upstream_rq_pending_failure_eject, Counter, Total requests that were failed due to a connection pool connection failure or remote connection termination
upstream_rq_pending_failure_eject, Counter, Total requests that were failed due to a connection pool connection failure or remote connection termination
upstream_rq_pending_active, Gauge, Total active requests pending a connection pool connection
upstream_rq_cancelled, Counter, Total requests cancelled before obtaining a connection pool connection
upstream_rq_maintenance_mode, Counter, Total requests that resulted in an immediate 503 due to :ref:`maintenance mode<config_http_filters_router_runtime_maintenance_mode>`
Expand All @@ -87,7 +87,8 @@ Every cluster has a statistics tree rooted at *cluster.<name>.* with the followi
upstream_internal_redirect_succeed_total, Counter, Total number of times internal redirects resulted in a second upstream request.
membership_change, Counter, Total cluster membership changes
membership_healthy, Gauge, Current cluster healthy total (inclusive of both health checking and outlier detection)
membership_degraded, Gauge, Current cluster degraded total
membership_degraded, Gauge, Current cluster :ref:`degraded <arch_overview_load_balancing_degraded>` total
membership_excluded, Gauge, Current cluster :ref:`excluded <arch_overview_load_balancing_excluded>` total
membership_total, Gauge, Current cluster membership total
retry_or_shadow_abandoned, Counter, Total number of times shadowing or retry buffering was canceled due to buffer limits
config_reload, Counter, Total API fetches that resulted in a config reload due to a different config
Expand Down
12 changes: 8 additions & 4 deletions docs/root/intro/arch_overview/upstream/health_checking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,8 @@ Further reading:
* :ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin endpoint.
* :ref:`/healthcheck/ok <operations_admin_interface_healthcheck_ok>` admin endpoint.

.. _arch_overview_health_checking_fast_failure:

Active health checking fast failure
-----------------------------------

Expand All @@ -129,10 +131,12 @@ When using active health checking along with passive health checking (:ref:`outl
large amount of active health checking traffic. In this case, it is still useful to be able to
quickly drain an upstream host when using the :ref:`/healthcheck/fail
<operations_admin_interface_healthcheck_fail>` admin endpoint. To support this, the :ref:`router
filter <config_http_filters_router>` will respond to the :ref:`x-envoy-immediate-health-check-fail
filter <config_http_filters_router>` *and* the HTTP active health checker will respond to the
:ref:`x-envoy-immediate-health-check-fail
<config_http_filters_router_x-envoy-immediate-health-check-fail>` header. If this header is set by
an upstream host, Envoy will immediately mark the host as being failed for active health check. Note
that this only occurs if the host's cluster has active health checking :ref:`configured
an upstream host, Envoy will immediately mark the host as being failed for active health check and
:ref:`excluded <arch_overview_load_balancing_excluded>` from load balancing. Note that this only
occurs if the host's cluster has active health checking :ref:`configured
<config_cluster_manager_cluster_hc>`. The :ref:`health checking filter
<config_http_filters_health_check>` will automatically set this header if Envoy has been marked as
failed via the :ref:`/healthcheck/fail <operations_admin_interface_healthcheck_fail>` admin
Expand All @@ -152,7 +156,7 @@ is that overall configuration becomes more complicated as every health check URL

The Envoy HTTP health checker supports the :ref:`service_name_matcher
<envoy_v3_api_field_config.core.v3.HealthCheck.HttpHealthCheck.service_name_matcher>` option. If this option is set,
the health checker additionally compares the value of the *x-envoy-upstream-healthchecked-cluster*
the health checker additionally compares the value of the *x-envoy-upstream-healthchecked-cluster*
response header to *service_name_matcher*. If the values do not match, the health check does not pass.
The upstream health check filter appends *x-envoy-upstream-healthchecked-cluster* to the response headers.
The appended value is determined by the :option:`--service-cluster` command line option.
Expand Down
29 changes: 29 additions & 0 deletions docs/root/intro/arch_overview/upstream/load_balancing/excluded.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. _arch_overview_load_balancing_excluded:

Excluded endpoints
------------------

Certain conditions may cause Envoy to *exclude* endpoints from load balancing. Excluding a host
means that for any load balancing calculations that adjust weights based on the ratio of eligible
hosts and total hosts (priority spillover, locality weighting and panic mode) Envoy will exclude
these hosts in the denominator.

For example, with hosts in two priorities P0 and P1, where P0 looks like {healthy, unhealthy
(excluded), unhealthy (excluded)} and where P1 looks like {healthy, healthy} all traffic will still
hit P0, as 1 / (3 - 2) = 1.

Excluded hosts allow scaling up or down the number of hosts for a given cluster without entering
panic mode or triggering priority spillover.

If panic mode is triggered, excluded hosts are still eligible for traffic; they simply do not
contribute to the calculation when deciding whether panic mode is enabled or not.

Currently, the following two conditions can lead to a host being excluded when using active
health checking:

* Using the :ref:`ignore_new_hosts_until_first_hc
<envoy_api_field_Cluster.CommonLbConfig.ignore_new_hosts_until_first_hc>` cluster option.
* Receiving the :ref:`x-envoy-immediate-health-check-fail
<config_http_filters_router_x-envoy-immediate-health-check-fail>` header in a normal routed
response or in response to an :ref:`HTTP active health check
<arch_overview_health_checking_fast_failure>`.
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Load Balancing
locality_weight
overprovisioning
panic_threshold
excluded
original_dst
zone_aware
subsets
15 changes: 15 additions & 0 deletions docs/root/version_history/current.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,21 @@ Minor Behavior Changes
----------------------
*Changes that may cause incompatibilities for some users, but should not for most*

* healthcheck: the :ref:`health check filter <config_http_filters_health_check>` now sends the
:ref:`x-envoy-immediate-health-check-fail <config_http_filters_router_x-envoy-immediate-health-check-fail>` header
for all responses when Envoy is in the health check failed state. Additionally, receiving the
:ref:`x-envoy-immediate-health-check-fail <config_http_filters_router_x-envoy-immediate-health-check-fail>`
header (either in response to normal traffic or in response to an HTTP :ref:`active health check <arch_overview_health_checking>`) will
cause Envoy to immediately :ref:`exclude <arch_overview_load_balancing_excluded>` the host from
load balancing calculations. This has the useful property that such hosts, which are being
explicitly told to disable traffic, will not be counted for panic routing calculations. See the
excluded documentation for more information. This behavior can be temporarily reverted by setting
the `envoy.reloadable_features.health_check.immediate_failure_exclude_from_cluster` feature flag
to false. Note that the runtime flag covers *both* the health check filter responding with
`x-envoy-immediate-health-check-fail` in all cases (versus just non-HC requests) as well as
whether receiving `x-envoy-immediate-health-check-fail` will cause exclusion or not. Thus,
depending on the Envoy deployment, the feature flag may need to be flipped on both downstream
and upstream instances, depending on the reason.
* http: allow to use path canonicalizer from `googleurl <https://quiche.googlesource.com/googleurl>`_
instead of `//source/common/chromium_url`. The new path canonicalizer is enabled by default. To
revert to the legacy path canonicalizer, enable the runtime flag
Expand Down
9 changes: 8 additions & 1 deletion generated_api_shadow/envoy/admin/v3/clusters.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 8 additions & 1 deletion generated_api_shadow/envoy/admin/v4alpha/clusters.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 3 additions & 19 deletions generated_api_shadow/envoy/config/cluster/v3/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit deed328

Please sign in to comment.