Skip to content

ziquanmiao/kubernetes_datadog

Repository files navigation

Intro

This Workshop assumes you have access to an existing Kubernetes Cluster, but will quickly demo pathing using Google Kubernetes Engine using their web console

In addition, you will need a Datadog Account and have access to an API key -- Start a Free Trial Here!

This repo showcases a Kubernetes-based path to deploying a simple Python Flask Application Service and a simple Java SpringBoot Application Service that all return some sample text contained in a separate postgres container.

The goal of this repo is to demonstrate the steps involved in installing a Datadog agent and showcase the product's Infrastructure Monitoring, Application Performance Monitoring, Live Process/Container Monitoring, and Log Monitoring Capabilities in a Kubernetes x Docker based environment.

This repo makes no accommodations for proxy scenarios and does not fully accommodate situations where machines are unable to pull from the internet to download packages.

Steps to Success

The gist of the setup portion is:

  1. Spin up GKE Instance
  2. Deploy!

Set up

GKE Start up

Start GKE instance via their console To get started, you can simply use the Standard Cluster template with something like 3 nodes to get you in a good place to run this demo.

Find the Cluster instance in the page and click the Connect Button followed by the Run in Cloud Shell option to spin up a browser based shell to interface with the cluster.

You will essentially SSH into a "gcloud" virtual machine terminal that is outside the scope of the cluster however Google preloads it with a command to scope the local kubectl interface to interact direct with the cluster.

You will also need to initiate with running

kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user $(gcloud config get-value account)

This is important for the DCA and Kubernetes State metrics (explained later)

Cluster Exists already

Store the Datadog API key in a kubernetes secret so its not directly in the deployment code

kubectl create secret generic datadog-api --from-literal=token=___INSERT_API_KEY_HERE___

Do the same thing for APP key in a secret called datadog-app

kubectl create secret generic datadog-app --from-literal=token=___INSERT_APP_KEY_HERE___

Create a secret 32 character token for DCA

kubectl create secret generic datadog-auth-token --from-literal=token=12345678901234567890123456789012

The key is then referenced in the Daemon file here

Optional -- Build Things

Should the cluster be unable to head over to Docker Hub and access the public files, its probably best to build the images locally

docker build -t sample_flask:007 ./flask_app/
docker build -t sample_postgres:007 ./postgres/
docker build -t sample_springboot:007 ./SpringBoot_app/

Be aware you need to change the yaml files so the referenced images becomes the above images you build

For example, change the referenced image in springboot to sample_springboot:007

Deploy Things

Deploy the postgres container (this needs to happen first so the service host/port envs properly load into subsequent containers)

kubectl create -f postgres_deployment.yaml

Deploy the Flask application container and turn it into a service Also create a configMap for the logs product

kubectl create -f flask_deployment.yaml

Deploy the SpringBoot application container and turn it into a service Also create a configMap for the logs product

kubectl create -f springboot_deployment.yaml

Deploy the Datadog agent container

kubectl create -f agent_daemon.yaml

Deploy a nonfunction pause container to demonstrate Datadog AutoDiscovery via a simple HTTP check against www.google.com

kubectl create -f pause.yaml

Deploy kubernetes state files to demonstrate kubernetes_state check

kubectl create -f kubernetes

And we are done!

Use the Flask App

The Flask App offers 3 endpoints that returns some text FLASK_SERVICE_IP:5000/, FLASK_SERVICE_IP:5000/query, FLASK_SERVICE_IP:5000/bad

Run kubectl get services to find the FLASK_SERVICE_IP address of the flask application service

Use the SpringBoot App

The SpringBoot App offers 3 endpoints that returns some text SpringBoot_SERVICE_IP:8080/, SpringBoot_SERVICE_IP:8080/query

Run kubectl get services to find the SpringBoot_SERVICE_IP address of the flask application service

Cluster Accessible via Internet

If you used the default GKE template or know the cluster is accessible via the internet, you can use the IP found in the EXTERNAL-IP column as SERVICE_IP

then hit one of the following:

curl SpringBoot_SERVICE_IP:5000/
curl SpringBoot_SERVICE_IP:5000/query

to see the Flask application at work

Cluster Not Accessible Via Internet

otherwise you can reference the CLUSTER-IP option as SERVICE_IP

You must SSH into one of the nodes in the cluster and then you can run one of the following:

In Google, you can simply do this by accessing their Compute Nodes Console Page and click the SSH button

JMX Metrics

Native Datadog Agent Fetch

Datadog's Java Agent Module after v0.17.0 can accommodate JMX metrics out of the box. Simply turn on the feature and point the module to the relevant Datadog Agent Statsd endpoint and you should see jvm.* metrics in your metrics summary page

Some points of interest

The Datadog agent container should now be deployed and is acting as a collector and middleman between the services and Datadog's backend. Through actions -- curling the endpoints -- and doing nothing, metrics will be generated and directed to the corresponding Datadog Account based off your supplied API key

Below is a quick discussion on some points of interest

Infrastructure Product

This part pertains to the ingestion of timeseries data, status checks, and events.

By deploying the agent referencing the Datadog Container Image in the agent_daemon.yaml file, the check automatically comes prepackaged with system level (CPU, Mem, IO, Disk), Kubernetes, and Docker level checks.

The gist of the setup portion is:

  1. Deploy the agent daemon the proper environment variables, volume and volumeMount arguments
  2. Deploy relevant applications with annotations
  3. Validate metrics go to agent and ends up in our application

System Metric Requirements

volumeMount and volume for the proc directory is required from the host level

In the Datadog web application you can reference the host map, and filter on the particular hostname to see what is going on.

Kubernetes/Docker

docker.sock and cgroup volumeMounts and volumes are required to be attached in the daemonset

Autodiscovery

Application/Service Pods and Containers go up and down. The Datadog agent traditionally requires a modification of the hardcoded host/port values of corresponding configuration files (example with postgres) and an agent restart to collect Data for installed software.

Rather than having a Mechanical Turk sit on standy ready to make the changes, the containerized accommodates makes this process automagic using Autodiscovery where the agent has the capability to monitor the annotations of deployments and automatically establish checks as pods come and go.

To set up autodiscovery, you will need to set up volumes and mountPaths to put the ethemeral configuration files

Postgres Example

Autodiscovery of the postgres pod in this container is straightforward, simply annotation in the configuration file by adding in the typical check sections required.

Note the annotation arguments here must be identical for the agent to properly connect to the container.

Prometheus

Many services (like kubernetes itself) utilizes Prometheus as an enhancement to reveal custom internal metrics specific to the service.

You can see what the structure of the prometheus metrics look like by running:

curl SpringBoot_SERVICE_IP:8080/actuator/prometheus

Datadog has the innate capability to read the log structure of prometheus produced metrics and turn it into custom metrics. The scope of this repo doesn't really touch on it too much, but collecting prometheus metrics can simply be done via annotations done at the deployment level as seen here -- note this example fails on purpose so you can see what an error looks like in agent status

Read more about it via our documentation

Live Processes/Containers

Live Process/Container Monitoring is the capability to get container and process level granularity for all monitored systems. This feature provides not only standard system level metrics at the process/container level, but also on the initial run commands used to set up the process/container.

Simply add DD_PROCESS_AGENT_ENABLED env variable in the daemonset to turn on this feature

Requirements

Sometimes passwords are revealed in the initial run commands, the agent comes equipped with passwd to remove a standard set of arguments

We again need docker.sock to get container information.

Validation

run kubectl get pods to get the pod name of the agent container.

run kubectl exec -it POD_NAME bash to hop into the container

run agent status to see the status summary and look for the integrations section to see agent is collecting metrics

run cat /var/log/datadog/agent.log to see logs pertaining to the agent

APM Tracing

The same agent that handles infrastructure metrics can also accommodate receiving Trace Data from a designated APM module -- these modules sit on top of your applications and forward payloads to a local Datadog agent to middle man to our backend.

Applications are spinning up in pods and we need payloads being fired to the sidecar agent pod. In this example, we set up a route between the pods with a port going through the host level.

Requirements

From the agent daemon side

Enable agent to receive traces from the Agent deployment side via environment variables

Create a port connection to host via the 8126 port

From the application Side

Provide the deploy file with a link to the host level for port 8126 via environmental variables, so that applications can reference the host/port values to fire traces to

Flask specific

Have the ddtrace module

In the app.py code, import ddtrace module and patch both sqlalchemy and the Flask app object.

Note: the trace module is an implementation as all modules are. If certain spans are not being captured, you can always hardcode them in.

SpringBoot specific

In the Dockerfile setting up SpringBoot, make sure you initiate the runtime commands to mount dd-java-agent.jar as a javaagent as seen here

Networking must also be possible so SpringBoot can fire to Agent pod's 8126 port so that means in springboot_deployment you need this piece in agent_daemon you need this piece

Note: the trace module is an implementation as all modules are so you can read what technologies we are compatible with here. If certain clauses in your code are not being captured, you can always hardcode them in.

Trace Search

Datadog also recently integrated their APM platform with the logs platform. You can totally think about APM traces as if they were logs. As of 10/11/18, you need to whitelist which APM services and endpoints you want collected via this agent_daemon configuration setting

Validation

Agent Side

run kubectl get pods to get the pod name of the agent container.

run kubectl exec -it POD_NAME bash to hop into the container

run agent status to see the status summary and look for the tracing section to see agent has tracing turned on

run cat /var/log/datadog/trace-agent.log to see logs pertaining to the trace agent

Datadog Side

Head over to Trace Services Page and look for your service level metrics and traces!

Logs

The same agent polling for metrics periodically for infrastructure metrics, live process metrics, and middle manning trace transcations can also be set up to tail log instances.

Simply turn on DD_LOGS_ENABLED via the environmental variable in the agent daemon file.

Tailing Flask logs via Config Maps

Tailing logs from flask is pretty easy using kubernetes config maps.

Agent Side

Simply set up the mountPath directory connected via the host and the volume in the agent daemon

Flask side

Set up the corresponding mounts for volume and mountPath so we can connect the flask pod to the relevant agent pod via the host

In the app, set up the app level logging configurations so logs are properly pushed to the right log file when routines are run (example)

Validation

Agent Side

run kubectl get pods to get the pod name of the agent container.

run kubectl exec -it POD_NAME bash to hop into the container

run agent status to see the status summary and look for the logging section to see agent has logs turned on

run cat /var/log/datadog/agent.log to see relevant log instances pertaining to the log agent

Datadog Side

Navigate to the logs explorer page and look for flask logs!

About

Show a sample kubernetes cluster using Datadog

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published