This technical preview of the Prometheus Akamai Global Traffic Management (GTM) Metrics Exporter publishes Akamai GTM Traffic and Liveness Report data as up
metrics. With GTM metrics, Prometheus can track GTM property and datacenter request traffic, as well as property liveness errors. Alerts can also be triggered utilizing generated metrics; e.g., Domain datacenter requests exceeding a threshhold or the number of liveness test failures for a property exceeding a threshhold.
- Install and build the GTM exporter.
- Configure and start the GTM Exporter to generate metrics for Prometheus.
- Validate that the exporter target is live and metrics are available in Prometheus.
- Prometheus environment.
- Go environment.
- Valid API client with authorization to use the Global Traffic Management Reporting API. Akamai API Authentication provides an overview and further information pertaining to the generation of authorization credentials for API based applications and tools.
go get -u github.com/akamai/akamai-gtm-metrics-exporter
A docker image can be generated by executing the following comand:
make docker
The resulting image is named /akamai/akamai-gtm-metrics-exporter-linux-amd64:<git-branch>
.
make build
make test
The exporter requires Akamai Open Edgegrid credentials to configure the GTM API connection and can get credentials from:
- An
.edgerc
file and section set with the exporter configuration file. - Environment variables.
- Command line arguments.
Configuration for the GTM exporter is usually done in a file in the working directory (e.g., ./gtm_metrics_example_config.yml
). An example can be found in
gtm_metrics_example_config.yml. This configuration file may contain the following settings.
Configuration element | Description |
---|---|
domains | (Required) Akamai GTM domains to collect traffic metrics from |
edgerc_path | (Optional) Accessible path to Edgegrid credentials file, e.g /home/test/.edgerc |
edgerc_section | (Optional) Section in the Edgegrid credentials file containing credentials, note: remember to include the edgerc_section if specifying an edgerc_path |
summary_window | (Optional) Rolling window for relevant metric data such as quantiles in [m]ins, [h]ours, or [d]ays. Default: 2 days (2d) |
prefill_window | (Optional) Prefill window for report data retrieval in [m]ins, [h]ours, or [d]ays. Default: 10 minutes (10m) |
timestamp_label | (Optional) Flag indicates if time series should be created with traffic timestamp as label |
traffic_timestamp | (Optional) Flag indicates if time series should be created with the traffic timestamp |
Authentication credentials as environment variables can exist as follows.
Environment Variable | Description |
---|---|
AKAMAI_HOST | Akamai Edgegrid API server |
AKAMAI_ACCESS_TOKEN | Akamai Edgegrid API access token |
AKAMAI_CLIENT_TOKEN | Akamai Edgegrid API client token |
AKAMAI_CLIENT_SECRET | Akamai Edgegrid API client secret |
Prometheus target configuration is minimal. The following fragment shows settings for a static configuration for a target pointing to the GTM exporter, the scrape interval and the scrape timeout.
global:
scrape_interval: 15s
scrape_timeout: 15s
scrape_configs:
- job_name: 'gtm'
static_configs:
- targets: ['docker.for.mac.localhost:9800']
Note: targets point to GTM Exporters.
./akamai-gtm-metrics-traffic-exporter
In the log, the exporter will publish a series of INFO messages to show normal operation. Look for the Beginning to serve on address:
message to learn its port.
INFO[0000] Config file: gtm_metrics_config.yml source="main.go:165"
INFO[0000] Starting GTM Metrics exporter. (version=0.1.0, branch=master, revision=99e6b08228e8772cde72818b5dcdd1b73ae633b1) source="main.go:166"
INFO[0000] Build context: (go=go1.14.9, user=elynes@bos-lhvhpa, date=20210127-19:53:16) source="main.go:167"
INFO[0000] akamai_gtm_metrics_exporter config loaded source="main.go:261"
INFO[0000] GTM Metrics exporter start time: 2021-01-27 15:53:27.062040712 +0000 UTC source="main.go:194"
INFO[0000] Beginning to serve on address :9800 source="main.go:231"
Note: running the exporter without the appropriate settings to access the GTM Traffic Reporting API will only publish build info like below. To validate, visit the exporter's metrics view with a browser using local host and the exporter's port known from one of the INFO startup messages (e.g., http://localhost:9800/metrics).
# HELP akamai_gtm_metrics_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which akamai_gtm_metrics_exporter was built.
# TYPE akamai_gtm_merics_exporter_build_info gauge
akamai_gtm_metrics_exporter_build_info{branch="master",goversion="go1.15.6",revision="84667d49203590616cd6d1b07d75715eaff31392",version="0.1.0"} 1
Use -h or --help flag to list available options.
./akamai-gtm-metrics-traffic-exporter --help
usage: akamai-gtm-metrics-exporter [<flags>]
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--config.file="gtm_metrics_config.yml"
GTM Metrics exporter configuration file. Default: `./gtm_metrics_config.yml`.
--web.listen-address=":9800"
The address to listen on for HTTP requests.
--gtm.edgegrid-host=GTM.EDGEGRID-HOST
The Akamai Edgegrid host auth credential.
--gtm.edgegrid-client-secret=GTM.EDGEGRID-CLIENT-SECRET
The Akamai Edgegrid client_secret credential.
--gtm.edgegrid-client-token=GTM.EDGEGRID-CLIENT-TOKEN
The Akamai Edgegrid client_token credential.
--gtm.edgegrid-access-token=GTM.EDGEGRID-ACCESS-TOKEN
The Akamai Edgegrid access_token credential.
--log.level="info" Only log messages with the given severity or above. Valid levels: [debug, info, warn, error, fatal]
--log.format="logger:stderr"
Set the log target and format. Example: "logger:syslog?appname=bob&local=7" or "logger:stdout?json=true".
--version Show application version.
Note: By default, the exporter expects the configuration file to exist in the current working directory (e.g., ./gtm_metrics_example_config.yml
).
Invoke exporter with a configuration file path
./akamai-gtm-metrics-traffic-exporter --config.file=gtm_metrics_example_config.yml
Invoke exporter with a configuration file path and Edgegrid authentication credentials
./akamai-gtm-metrics-traffic-exporter --config.file=gtm_metrics_example_config.yml --edgedns.edgegrid-host akab-abcdefghijklmnop-01234567890aaaaa.luna.akamaiapis.net --edgedns.edgegrid-access-token example_provided_access_token --edgedns.edgegrid-client-token example_provided_client_token --edgedns.edgegrid-client-secret example_provided_client_secret
The Akamai GTM Exporter contains collectors to gather traffic information for GTM domain datacenters and properties, as well as property liveness test failures. Each of these collectors has its own configuration, metrics and behaviors.
The Datacenter collector gathers traffic data for GTM domain datacenters.
An example configuration snippet for the datacenter collector is:
domains:
- domain_name: testdomain.akadns.net # domain to collect from (list)
datacenters:
- datacenter_id: 3131 # datacenter config to collect traffic metrics (list)
property:
- test_property # filter on property (list)
This example configuration instructs the collector to retrieve datacenter request activity from datacenter_id: 3131
and property
test_property
. In order to retrieve activity for the entire datacenter, omit the property
key.
The datacenter collector gathers the following metrics from the GTM Report API that returns datacenter requests aggregated in 5 minute intervals.
Metric | Description |
---|---|
akamai_gtm_datacenter_traffic_requests_per_interval | Number of datacenter requests per 5 minute interval (per domain) |
akamai_gtm_datacenter_traffic_requests_per_interval_summary_sum | Summary aggregation of datacenter requests per 5 minute interval (per domain) |
akamai_gtm_datacenter_traffic_requests_per_interval_summary_count | Summary count of datacenter requests per 5 minute interval (per domain) |
The base labels used for datacenter metrics are domain and datacenter. A property label will be added if a property filter is specified. A timestamp filter will also be added if configured for the exporter.
Note: The _sum and _count metrics are cumulative since start time.
The Property collector gathers traffic data for GTM domain properties.
An example configuration snippet for the property collector is:
domains:
- domain_name: testdomain.akadns.net # domain to collect from (list)
properties:
- property_name: test_property # property config from which to collect traffic metrics (list)
datacenter:
- 3131 # filter on datacenter id (list)
dc_nickname:
- test_nickname # filter on nickname (list)
target_name:
- test_target # filter on target name (list)
This example configuration instructs the collector to retrieve property requests activity from property_name
test_property
. The property requests can be further filtered by datacenter
, dc_nickname
or target_name
. Only the first in priority order will be used. Thus, in the example above, datacenter with id 3131 is used. To retrieve requests activity for the property across all its datacenters, omit the filter keys.
The property collector gathers the following metrics from the GTM Report API that returns datacenter requests aggregated in 5 minute intervals.
Metric | Description |
---|---|
akamai_gtm_property_traffic_requests_per_interval | Number of property requests per 5 minute interval (per domain) |
akamai_gtm_property_traffic_requests_per_interval_summary_sum | Summary aggregation of property requests per 5 minute interval (per domain) |
akamai_gtm_property_traffic_requests_per_interval_summary_count | Summary count of property requests per 5 minute interval (per domain) |
The base labels used for property metrics are domain and property. An additional label (datacenterid, nickname or target) will be added if a property filter is specified. A timestamp filter will also be added if configured for the exporter.
Note: The _sum and _count metrics are cumulative since start time.
The Liveness collector gathers liveness test failure status for domain properties.
An example configuration snippet for the liveness collector is:
domains:
- domain_name: testdomain.akadns.net # domain to collect from (list)
liveness_tests:
- property_name: test_property # property config from which to collect liveness test failures
agent_ip: 1.2.3.4 # filter on agent ip
target_ip: 4.3.2.1 # filter on target ip
This example configuration instructs the collector to retrieve liveness test failure activity from property_name
test_property
. The liveness failures data can be further filtered by agent_ip
or target_ip
. If both are specified, target_ip
will be used. Thus, in the example above, the returned test failure data will be filtered for tests associated with the target_ip
specified. To retrieve all liveness test failures for the property, omit the filter keys.
The liveness collector gathers the following metrics from the GTM Report API that returns data reflecting when tests are executed and failure status.
Metric | Description |
---|---|
akamai_gtm_property_liveness_errors_datacenter_failure_duration | Datacenter failure duration (per domain, property, datacenter) |
akamai_gtm_property_liveness_errors_datacenter_failures | Number of datacenter failures (per domain, property, datacenter) |
akamai_gtm_property_liveness_errors_errors_per_datacenter_summary_count | Summary count of datacenter errors (per domain and property) |
akamai_gtm_property_liveness_errors_errors_per_datacenter_summary_sum | Summary aggregation of datacenter errors (per domain and property) |
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_count | Histogram count of datacenter error duration (per domain and property) |
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_sum | Histogram aggregation of datacenter error duration (per domain and property) |
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket | Histogram buckets of datacenter error duration (per domain and property) |
The histogram duration buckets (in seconds) are: 60, 1800, 3600, 7200, and 14400.
The base labels used for liveness metrics are domain, property and datacenter. An additional label (targetip or agentip) will be added if a property filter is specified. A timestamp filter will also be added if configured for the exporter.
Note: The _sum and _count metrics are cumulative since start time.
To glimpse at GTM metric activity in the exporter, visit the exporter's metrics web page with a browser using local host and the exporter's port known from one of the INFO startup messages (e.g., http://localhost:9800/metrics). The following snippet shows example console output with all three collectors configured.
# HELP akamai_gtm_datacenter_traffic_requests_per_interval Number of datacenter requests per 5 minute interval (per domain)
# TYPE akamai_gtm_datacenter_traffic_requests_per_interval gauge
akamai_gtm_datacenter_traffic_requests_per_interval{datacenter="3131",domain="test.akadns.net",property="testprop"} 283
# HELP akamai_gtm_datacenter_traffic_requests_per_interval_summary Number of aggregate datacenter requests per 5 minute interval (per domain)
# TYPE akamai_gtm_datacenter_traffic_requests_per_interval_summary summary
akamai_gtm_datacenter_traffic_requests_per_interval_summary_sum{datacenter="3131",domain="test.akadns.net"} 0
akamai_gtm_datacenter_traffic_requests_per_interval_summary_count{datacenter="3131",domain="test.akadns.net"} 0
# HELP akamai_gtm_metrics_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which akamai_gtm_metrics_exporter was built.
# TYPE akamai_gtm_metrics_exporter_build_info gauge
akamai_gtm_metrics_exporter_build_info{branch="",goversion="go1.14.9",revision="",version=""} 1
# HELP akamai_gtm_property_liveness_errors_datacenter_failure_duration Datacenter falure duration (per domain, property, datacenter)
# TYPE akamai_gtm_property_liveness_errors_datacenter_failure_duration gauge
akamai_gtm_property_liveness_errors_datacenter_failure_duration{datacenter="3201",domain="test.akadns.net",property="testprop"} 0
# HELP akamai_gtm_property_liveness_errors_datacenter_failures Number of datacenter failures (per domain, property, datacenter)
# TYPE akamai_gtm_property_liveness_errors_datacenter_failures counter
akamai_gtm_property_liveness_errors_datacenter_failures{datacenter="3201",domain="test.akadns.net",property="testprop"} 1
# HELP akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram Histogram of datacenter error duration (per domain and property)
# TYPE akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram histogram
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{datacenter="3201",domain="test.akadns.net",property="testprop",le="60"} 3
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{datacenter="3201",domain="test.akadns.net",property="testprop",le="1800"} 3
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{datacenter="3201",domain="test.akadns.net",property="testprop",le="3600"} 3
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{datacenter="3201",domain="test.akadns.net",property="testprop",le="7200"} 3
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{datacenter="3201",domain="test.akadns.net",property="testprop",le="14400"} 3
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{datacenter="3201",domain="test.akadns.net",property="testprop",le="+Inf"} 3
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_sum{datacenter="3201",domain="test.akadns.net",property="testprop"} 0
akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_count{datacenter="3201",domain="test.akadns.net",property="testprop"} 3
# HELP akamai_gtm_property_liveness_errors_errors_per_datacenter_summary Summary of datacenter errors (per domain and property)
# TYPE akamai_gtm_property_liveness_errors_errors_per_datacenter_summary summary
akamai_gtm_property_liveness_errors_errors_per_datacenter_summary_sum{datacenter="3201",domain="test.akadns.net",property="testprop"} 3
akamai_gtm_property_liveness_errors_errors_per_datacenter_summary_count{datacenter="3201",domain="test.akadns.net",property="testprop"} 3
# HELP akamai_gtm_property_traffic_requests_per_interval Number of property requests per 5 minute interval (per domain)
# TYPE akamai_gtm_property_traffic_requests_per_interval gauge
akamai_gtm_property_traffic_requests_per_interval{datacenterid="3131",domain="test.akadns.net",property="testprop"} 283
# HELP akamai_gtm_property_traffic_requests_per_interval_summary Number of aggregate property requests per 5 minute interval (per domain)
# TYPE akamai_gtm_property_traffic_requests_per_interval_summary summary
akamai_gtm_property_traffic_requests_per_interval_summary_sum{domain="test.akadns.net",property="testprop"} 0
akamai_gtm_property_traffic_requests_per_interval_summary_count{domain="test.akadns.net",property="testprop"} 0
To view the metrics in Prometheus, visit Graph and Execute an expression for one of the metrics. As an example, the following image shows the graph for akamai_gtm_datacenter_traffic_requests_per_interval_summary_sum
.
As another example, the following expression will render an average rate of datacenter traffic for the last 30 minutes.
avg_over_time(akamai_gtm_datacenter_traffic_requests_per_interval[30m])
Prometheus' default TLDB storage bounds the timestamp window that it will accept for newly created time series metrics (~2-3 hours past to current). Given that the GTM API works in its own timecycle with sometimes no data for a given interval, advanced configuration options exist with defaults.
- Report timestamp.
- Summary data size of 2 days.
- Prefill set to 10 minutes.
Changing the advanced configuration defaults, though, comes with associated Prometheus behavior changes.
Adding a timestamp label maybe helpful in knowing the actual time and day that the event. Adding a timestamp label to each metric time series has the side effect of creating a distinct series for each label/timestamp combination. When retrieving metrics, it is recommended to use only the desired labels in the query expression. The legend displayed when viewing graphs through the Prometheus portal will contain all generated series; hundreds per day. Other viewing applications, e.g. Grafana, will allow graph customization and reduced screen clutter.
The table tab in the Prometheus portal may provide a more manageable means to view metrics with a timestamp label. For example by only retrieving the last five (5) minutes of collected metrics; e.g. akamai_gtm_datacenter_traffic_requests_per_interval{datacenter="3131",domain="testdomain.akadns.net",property="testprop"}[5m]
.
The summary_window
configuration informs the collector as to how much data to include when calculating requests for relevant metrics such as quantiles.
The Prometheus server will reject, and not persist, the exporter's attempt to create metrics with a timestamp outside of the current time series database collection window. The Prometheus log will note a warning in this case, e.g.
level=warn ts=2021-01-12T18:56:49.492Z caller=scrape.go:1378 component="scrape manager" scrape_pool=edgedns_zone target=http://localhost:9800/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2
and continue to collect future metric data. The dropped data will not be available for further viewing, analysis or alerting. This behavior is most likely to occur if the prefill_window
is configured to be greater than ~2 hours.
The prefill_window
informs the collector as to how far to reach back in time and incorporate historical report data in Prometheus. This "priming" of the TSDB will provide a headstart to view and analyze metric trends.
A side effect of configuring the prefill_window
to be greater than the current time series open window, combined with enabling metric creation with timestamps, is that the Prometheus server will reject any metrics timestamped outside the current time series window. Aside from not creating the metrics, the log will also be cluttered with warnings to this effect.
Post processing of collected metrics may be designed in order to perform additional analysis of collected traffic data or detect abnormalities in the collected data. Post processing is done on the Prometheus server. The rules executed to accomplish this post processing are specified in the Prometheus server configuration file in the rules-files section. An example rules definition file, example_gtm_metrics_alerts.rules, defines recording rules to prepare for excessive datacenter requests detection in an interval and detection of datacenter failure durations greater than 30 minutes. Snippets of the example rules file configuration that define additional metrics and the expressions to produce the metrics:
- name: gtm_datacenter_requests_over_example
rules:
- record: instance_datacenter:akamai_gtm_datacenter_traffic_requests_per_interval:max1m
# labels must be literals. Can't template expressions
expr: max_over_time(akamai_gtm_datacenter_traffic_requests_per_interval{datacenter="3131",domain="test.domain.akadns.net"}[5m])
- record: instance_datacenter:akamai_gtm_datacenter_traffic_requests_per_interval_summary:mean
expr: |2
akamai_gtm_datacenter_traffic_requests_per_interval_summary_sum{datacenter="3131",domain="test.domain.akadns.net"}
/
akamai_gtm_datacenter_traffic_requests_per_interval_summary_count{datacenter="3131",domain="test.domain.akadns.net"}
- record: instance_datacenter:akamai_gtm_datacenter_traffic__requests_per_interval_summary:sub_mean
expr: (instance_datacenter:akamai_gtm_datacenter_traffic_requests_per_interval:max1m - instance_datacenter:akamai_gtm_datacenter_traffic_requests_per_interval_summary*2)
- name: gtm_datacenter_duration_over_example
rules:
- record: instance_datacenter:akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket:sub
expr: scalar(akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{domain="test.domain.akadns.net", property="testprop",datacenter="3131",le="3600"}) - scalar(akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket{domain="test.domain.akadns.net", property="testprop",datacenter="3131",le="1800"})
The first snippet identifies the largest number of datacenter requests in the five minutes, calculates the current average requests interval rate, and compares the high interval with a threshhold set as the average times 2. In this way, Prometheus records events of request spikes indicating excessive datacenter load.
The second snippet identifies the number of test failures with a duration between 30 minutes (1800 secs) and one hour (3600 seconds). Prometheus records the number of these failures, potentially indicating excessive datacenter down time.
These newly generated metrics can be viewed on a graph or built upon, as in the following example, to detect and generate an alert.
To detect and alert on an event or abnormality, two actions must be taken. First, an alert rule must be defined that will detect the activity of interest and generate the alert. The rules example defined in example_gtm_metrics_alerts.rules provides first steps to define alerts and configure the alert manager.
Two snippets from the rules file present alert rules that check whether the number of interval datacenter requests exceeds a threshhold and if any test durations exceeds 30 minutes:
- alert: DatacenterRequestsOutOfBounds
expr: instance_datacenter:akamai_gtm_datacenter_traffic_requests_per_interval_summary:sub_mean >= 0
labels:
domain: "test.domain.akadns.net"
datacenter: "3131"
severity: critical
annotations:
summary: "Datacenter requests exceeded Rolling average * 2"
description: "Job: {{ $labels.job }} Instance: {{ $labels.instance }} has Datacenter request count (current value: {{ $value }}s) compared to rolling average"
- alert: DatacenterExcessErrorDuration
expr: instance_datacenter:akamai_gtm_property_liveness_errors_duration_per_datacenter_histogram_bucket:sub > 0
labels:
domain: "test.domain.akadns.net"
property: "testprop"
datacenter: "3131"
severity: critical
annotations:
summary: "Datacenter test error duration exceeded 30 minutes"
description: "Job: {{ $labels.job }} Instance: {{ $labels.instance }}"
The second step is to configure the AlertManager, e.g. the receiver of the alert, to pick up the alert (based on specified criteria) and propagate it accordingly.
example_alertmanager_gtm_metrics.yml is a simple, example alertmanager configuration to receive alerts and propagate them via email.
Make sure the target is live and up in Prometheus Status > Targets.
Make sure the service definition is correct in Prometheus Status > Service Discovery.
Make sure the exporter is providing metrics to Prometheus. Visit the URL for the exporter (e.g., http://localhost:9800) and look for metrics such as the following:
# HELP akamai_gtm_datacenter_traffic_requests_per_interval Number of datacenter requests per 5 minute interval (per domain)
# TYPE akamai_gtm_datacenter_traffic_requests_per_interval gauge
akamai_gtm_datacenter_traffic_requests_per_interval{datacenter="3131",domain="testdomain.akadns.net",property="testprop"} 283
Make sure the scrape interval and timeout levels in the exporter configuration are at least 30s.
scrape_interval: 30s # By default, scrape targets every 15 seconds.
scrape_timeout: 30s
If using a docker image for the GTM exporter, Prometheus might need to explicitly reference the target appropriately.
static_configs:
- targets: ['docker.for.mac.localhost:9800']
- Prometheus backfill time series improvements will allow loading past data more effectively.
Apache License 2.0, see LICENSE.