Metadata associated with this integration can be found here. The relevant code for the integration can be found here.
-
CONSUL CLUSTER: Provides a high-level overview of metrics for a single Consul cluster.
-
CONSUL HEALTH: Provides key metrics to monitoring Consul's performance.
-
CONSUL SERVER: Provides server-specific metrics.
-
CONSUL CLIENT: Provides client-specific metrics.
-
CONSUL CLUSTER
-
Total Services: Shows the total number of services registered with the Consul cluster.
-
Total Nodes: Shows the total number of nodes in the Consul cluster's catalog. Nodes include instances running Consul agent in either client or server mode and external nodes registered with the Consul store.
-
Number of services by node: Descending list showing the number of services that are registered with a given node. The node name displayed is the Consul NodeName config value.
-
Number of Nodes by Service: Descending list showing the number of nodes that are providing a given service in the datacenter.
-
Service health check results: A list showing the results of service health checks that are registered with Consul. Checks can result in three states - passing, warning, and critical.
-
Node health check results: Node checks are done on the individual host level. If a host fails a check, all services registered with it are marked as failed and Consul no longer returns the node in service discovery requests. The chart is a list showing the results of node health checks. Checks can result in three states - passing, warning, and critical.
-
Total Peers: Number of Consul Raft peers or Consul agents in server mode in a given datacenter.
-
Consul Server Map: Displays the followers and leader in a given datacenter.
-
Mean node network latency: Shows the average latency of a given node from other nodes in the Consul cluster. The dimension
consul\_node
corresponds to the source node. The maximum and minimum values for this metric are also available. -
Mean datacenter latency: Average datacenter latency between two datacenters. This metric has the additional dimension
destination\_dc dimension
. The latency is calculated between this destination datacenter and the agent's datacenter given by the datacenter dimension. The maximum and minimum values for this metric are also available.
-
-
CONSUL HEALTH
-
Leadership Change Event: Event feed showing leader transition events. The event has the new and old leader node name as dimensions.
-
Leadership Transitions: Tracks number of leadership transitions. If there are frequent leadership changes, this may be an indication that the servers are overloaded and aren't meeting the soft real-time requirements for Raft, or that there are networking problems between the servers.
-
Leader last contact with followers: Shows the time since the leader was last able to contact the follower nodes when checking its leader lease. This chart can be used as a measure for how stable the Raft timing is and how close the leader is to timing out its lease.
-
Leader latency to commit to disk: Time it takes for the leader to write log entries to disk.
-
Raft commit time: Time it takes to commit a new entry to the Raft log on the leader.
-
Number of Raft Transactions: This is a general indicator of the write load on the Consul servers.
-
Leader Time to Append Entries: Measures the time it takes the leader to replicate log entries to followers. This is a general indicator of the load pressure on the Consul servers, as well as the performance of the communication between the servers.
-
Number of RPC queries: Total number of RPC queries per interval. This chart is a general measure of all read volume.
-
Cluster Joins and Leaves: Tracks successful node joins and leaves in the Serf memberlist.
-
Leader time to reconcile: Shows the time it takes for the leader to reconcile Serf membership and what is reflected in Consul's store.
-
Serf Events: Consul provides an event feature by which custom events can be propagated across your entire datacenter. This chart shows the number of events processed by Consul agents per interval. Using this chart, you can track if triggered events were processed by a Consul node. Additionally, you can also easily set up a chart to track events for a selected node in the CLIENT and SERVER dashboard.
-
Serf Event Queue: Shows the avg and max number of backlog of Serf events in queue of Consul agents.
-
-
CONSUL CLIENT
-
Number of allocated heap objects: Shows the number of heap objects allocated to the Consul process. Indicates memory pressure on a Consul node.
-
Allocated Bytes: Shows the number of allocated bytes to the Consul process.
-
Number of GO routines: Shows the number of GO routines Consul is running. This is a general load pressure indicator for Consul agent.
-
Network Latency: Shows the avg, max, and min network latency between the node and other nodes in the datacenter.
-
Time to service DNS queries: Consul provides both DNS and HTTP interfaces for service discovery. This chart shows the time it takes to service forward and reverse DNS lookups by the selected node.
-
-
CONSUL SERVER: All charts mentioned in the Client dashboard are also present in the Server dashboard. In addition to those, the following charts are present.
All metrics reported by the Consul collectd plugin will contain the following dimensions by default:
datacenter
: the datacenter to which the Consul agent belongs to. The value for this dimension is read from the agents' configuration.consul_node
: the Consul node name as seen in Consul agents' configuration.consul_mode
: the Consul agent is in client or server mode.
The metric consul.is_leader
is reported by Consul servers and have the dimension - consul_server_state
, which can be either leader or follower.
Additional default metrics to track:
-
consul.memberlist.msg.suspect
: This metric counts the number of times an agent suspects another as failed when executing random probes as part of the gossip protocol. This metric can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports. -
consul.serf.member.flap
: This metric tracks when an agent is marked dead and then recovers within a short time period. This metric can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the required ports. -
consul.dns.stale_queries
: This metric tracks when an agent serves a DNS query based on information from a server that is more than 5 seconds out of date.
Additional details:
plugin
is always set toconsul
.- To add additional metrics from the telemetry stream or
/agent/metrics
endpoint, use the configuration options mentioned in configuration. If metrics are being included individually, make sure to give valid prefixes. For example, to add metrics which track time taken to serve HTTP requests, Consul emits these metrics in the form ofconsul.http.<verb>.<path>
. To enable metrics which track time taken to service GET requests on Key/Value endpoint, addconsul.http.GET.v1.kv
to the IncludeMetric configuration. If you want to allow metrics which track time taken to service all GET requests, addconsul.http.GET
to the configuration. When enhance metrics are enabled, you can block metrics in a similar manner. - The metrics from
/agent/metric
endpoint are aggregated over an interval of 10 seconds. Keep this in mind when changing the default collectd interval from 10 seconds.
This integration is released under the Apache 2.0 license. See LICENSE for more details.