Skip to content

Latest commit

 

History

History

aws_neuron

Agent Check: AWS Inferentia and AWS Trainium Monitoring

Overview

This check monitors AWS Neuron through the Datadog Agent. It enables monitoring of the Inferentia and Trainium devices and delivers insights into your machine learning model's performance.

Setup

Follow the instructions below to install and configure this check for an Agent running on an EC2 instance. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The AWS Neuron check is included in the Datadog Agent package.

You also need to install the AWS Neuron Tools package.

No additional installation is needed on your server.

Configuration

Metrics

  1. Ensure that Neuron Monitor is being used to expose the Prometheus endpoint.

  2. Edit the aws_neuron.d/conf.yaml file, which is located in the conf.d/ folder at the root of your Agent's configuration directory, to start collecting your AWS Neuron performance data. See the sample aws_neuron.d/conf.yaml for all available configuration options.

  3. Restart the Agent.

Logs

The AWS Neuron integration can collect logs from the Neuron containers and forward them to Datadog.

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml file:

    logs_enabled: true
  2. Uncomment and edit the logs configuration block in your aws_neuron.d/conf.yaml file. Here's an example:

    logs:
      - type: docker
        source: aws_neuron
        service: aws_neuron

Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.

Then, set Log Integrations as pod annotations. This can also be configured with a file, a configmap, or a key-value store. For more information, see the configuration section of Kubernetes Log Collection.

Validation

Run the Agent's status subcommand and look for aws_neuron under the Checks section.

Data Collected

Metrics

See metadata.csv for a list of metrics provided by this integration.

Events

The AWS Neuron integration does not include any events.

Service Checks

See service_checks.json for a list of service checks provided by this integration.

Troubleshooting

In containerized environments, ensure that the Agent has network access to the endpoints specified in the aws_neuron.d/conf.yaml file.

Need help? Contact Datadog support.