This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Deploy Neuron Node Problem Detector and Recovery#
The Neuron Node Problem Detector and Recovery is a critical resiliency component that continuously monitors the health of Neuron devices on each Kubernetes node by detecting hardware and software errors such as device failures, driver problems, and runtime errors. It integrates with the Kubernetes Node Problem Detector framework to report Neuron-specific conditions. When unrecoverable issues are detected, it can automatically remediate problems by marking nodes as unhealthy and triggering node replacement to prevent workload scheduling on faulty hardware. The component can also publish CloudWatch metrics under the NeuronHealthCheck namespace for monitoring and alerting purposes.
Requirements
Before deploying the Neuron Node Problem Detector and Recovery, ensure the following requirements are met:
Neuron Driver: Version 2.15 or later
Neuron Runtime: SDK 2.18 or later
Prerequisites: All prerequisites for Kubernetes containers and the Neuron Node Problem Detector must be satisfied
Installation
Install the Neuron Node Problem Detector and Recovery as a DaemonSet using Helm:
Note
The installation pulls the container image from the upstream Node Problem Detector repository at registry.k8s.io/node-problem-detector.
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart
Enable Node Recovery
By default, the Neuron Node Problem Detector runs in monitor-only mode. To enable automatic node recovery functionality:
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
--set "npd.nodeRecovery.enabled=true"
Verify Installation
Verify that the Node Problem Detector pods are running:
kubectl get pod -n neuron-healthcheck-system
Expected output (example with 4 nodes in cluster):
NAME READY STATUS RESTARTS AGE
node-problem-detector-7qcrj 1/1 Running 0 59s
node-problem-detector-j45t5 1/1 Running 0 59s
node-problem-detector-mr2cl 1/1 Running 0 59s
node-problem-detector-vpjtk 1/1 Running 0 59s
Monitoring and Metrics
When an unrecoverable error occurs, the Neuron Node Problem Detector:
Publishes metrics to CloudWatch under the
NeuronHealthChecknamespaceUpdates the node’s
NodeCondition, which can be viewed using:kubectl describe node <node-name>
This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3