This document is relevant for: Inf1, Inf2, Trn1, Trn2

Neuron node problem detector and recovery artifact checks the health of Neuron devices on each Kubernetes node. After detecting an unrecoverable Neuron error, it triggers a node replacement. In order to get started with Neuron node problem detector and recovery, make sure that the following requirements are satisfied:

  • The Neuron node problem detector and recovery requires Neuron driver 2.15+, and it requires the runtime to be at SDK 2.18 or later.

  • Make sure prerequisites are satisfied. This includes prerequisites for getting started with Kubernetes containers and prerequisites for the Neuron node problem detector and recovery.

  • Install the Neuron node problem detector and recovery as a DaemonSet on the cluster with the following command:

    Note

    The installation pulls the container image from the upstream repository for node problem detector registry.k8s.io/node-problem-detector.

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart
    
  • By default, the Neuron node problem detector and recovery has monitor only mode enabled. To enable the recovery functionality:

    helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
        --set "npd.nodeRecovery.enabled=true"
    
  • Verify that the Neuron device plugin is running:

    kubectl get pod -n neuron-healthcheck-system
    

    Expected result (with 4 nodes in cluster):

    NAME                          READY   STATUS    RESTARTS   AGE
    node-problem-detector-7qcrj   1/1     Running   0          59s
    node-problem-detector-j45t5   1/1     Running   0          59s
    node-problem-detector-mr2cl   1/1     Running   0          59s
    node-problem-detector-vpjtk   1/1     Running   0          59s
    
  • When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.

This document is relevant for: Inf1, Inf2, Trn1, Trn2