This example demonstrates creating alerts for missing monitoring data with Stackdriver in a way that duplicate alerts are not generated. For example, suppose that you have 100 timeseries and you want to find out when any one of them is missing. If one or two timeseries are missing, you want exactly one alert. When there is a total outage you still want to get one alert, not 100 alerts.
The test app generates time series with a custom metric called task_latency_distribution with tags based on a partition label. Each partition generates its own time series.
The example assumes that you are familiar with Stackdriver Monitoring and Alerting. It builds on the discussion in Alerting policies in depth.
Clone this repo and change to this working directory. Enable the Stackdriver Monitoring API
gcloud services enable monitoring.googleapis.com
In the GCP Console, go to Monitoring. If you have not already created a workspace for this project before, click New workspace, and then click Add. It takes a few minutes to create the workspace. Click Alerting | Policies overview. The list should be empty at this point unless you have created policies previously.
The example code is based on the Go code in Custom metrics with OpenCensus.
Download and install the latest version of Go.
Build the test app
go build
The instructions here are based on the article Setting up authentication for the Stackdriver client library. Note that if you run the test app on a Google Cloud Compute Engine instance, on Google Kubernetes Engine, on App Engine, or on Cloud Run, then you will not need to create a service account or download the credentials.
First, set the project id in a shell variable
export GOOGLE_CLOUD_PROJECT=[your project]
Create a service account:
SA_NAME=stackdriver-metrics-writer
gcloud iam service-accounts create $SA_NAME \
--display-name="Stackdriver Metrics Writer"
SA_ID="$SA_NAME@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com"
gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member "serviceAccount:$SA_ID" --role "roles/monitoring.metricWriter"
Generate a credentials file with an exported variable GOOGLE_APPLICATION_CREDENTIALS referring to it.
mkdir -p ~/.auth
chmod go-rwx ~/.auth
export GOOGLE_APPLICATION_CREDENTIALS=~/.auth/stackdriver_demo_credentials.json
gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \
--iam-account $SA_ID
Run the program with three partitions, labeled "1", "2", and "3"
./alert-absence-demo --labels "1,2,3"
This will write three time series with the given labels. A few minutes after starting the app you should be able to see the time series data in the Stackdriver Metric explorer.
Create a notification channel with the command
EMAIL="your email"
CHANNEL=$(gcloud alpha monitoring channels create \
--channel-labels=email_address=$EMAIL \
--display-name="Email to project owner" \
--type=email \
--format='value(name)')
Create an alert policy with the command
gcloud alpha monitoring policies create \
--notification-channels=$CHANNEL \
--documentation-from-file=policy_doc.md \
--policy-from-file=alert_policy.json
At this point no alerts should be firing. You can check that in the Stackdriver Monitoring console alert policy detail.
Kill the processes with CTL-C and restart it with only two partitions:
./alert-absence-demo --labels "1,2"
An alert should be generated in about 5 minutes with a subject like ‘One of the timeseries is absent.’ You should be able to see this in the Stackdriver console and you should also receive an email notification.
Restart the process with three partitions:
./alert-absence-demo --labels "1,2,3"
After a few minutes the alert should resolve itself.
Stop the process and restart it with only one partition:
./alert-absence-demo --labels "1"
Check that only one alert is fired.
Stop the process and do not restart it. You should see an alert that indicates all time series are absent.