This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Neuron Monitor is a monitoring solution that collects and exposes metrics from Neuron devices and the Neuron runtime. It provides visibility into hardware utilization, performance counters, memory usage, and device health status. The monitor can export metrics in formats compatible with popular observability platforms like Prometheus, enabling integration with existing monitoring and alerting infrastructure. This allows operators to track Neuron device performance, identify bottlenecks, and troubleshoot issues in production environments.

For detailed information about Neuron Monitor, see the Neuron Monitor User Guide.

Note

Neuron Monitor does not currently support environments using the Neuron DRA (Dynamic Resource Allocation) Driver.

Deploy Neuron Monitor DaemonSet#

Step 1: Download the Configuration

Download the Neuron Monitor YAML file: k8s-neuron-monitor-daemonset.yml

Step 2: Apply the Configuration

Apply the Neuron Monitor YAML to create a DaemonSet on the cluster:

kubectl apply -f k8s-neuron-monitor-daemonset.yml

Step 3: Verify Installation

Verify that the Neuron Monitor DaemonSet is running:

kubectl get ds neuron-monitor --namespace neuron-monitor

Expected output (example with 2 nodes in cluster):

NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-monitor   2         2         2       2            2           <none>          27h

Step 4: Get Pod Names

Retrieve the Neuron Monitor pod names:

kubectl get pods --namespace neuron-monitor

Expected output:

NAME                   READY   STATUS    RESTARTS   AGE
neuron-monitor-slsxf   1/1     Running   0          17m
neuron-monitor-wc4f5   1/1     Running   0          17m

Step 5: Verify Prometheus Endpoint

Verify that the Prometheus metrics endpoint is available:

kubectl exec neuron-monitor-wc4f5 --namespace neuron-monitor -- wget -q --output-document - http://127.0.0.1:8000

Expected output (sample metrics):

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 362.0
python_gc_objects_collected_total{generation="1"} 0.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3