This document is relevant for: Inf1, Inf2, Trn1, Trn1n

Deploy Containers with Neuron#

In this section you will find resources to help you use containers for your accelerated deep learning model acceleration on top of Inferentia and Trainium enabled instances.

The section is organized based on the target deployment environment and use case. In most cases, it is recommended to use a preconfigured Deep Learning Container (DLC) from AWS. Each DLC is pre-configured to have all of the Neuron components installed and is specific to the chosen ML Framework.

Locate Neuron DLC image

The Pytorch Neuron DLC images are published to ECR Public, which is the recommended URL to use for most cases. If you are working within AWS SageMaker, you should use the Amazon ECR URL instead of the Amazon ECR Public one because of the restriction of Sagemaker. TensorFlow DLCs are not updated with the latest release. For earlier releases please check here.

Framework

Neuron Package

Job Type

Supported EC2 Instance Types

Python Version Options

ECR Public Repo URL

Image Details

Other Packages

PyTorch 2.1.2

aws-neuronx-tools, neuronx_distributed, torch-neuronx, transformers-neuronx

inference

trn1 and inf2

3.10 (py310)

https://gallery.ecr.aws/neuron/pytorch-inference-neuronx

https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx

torchserve

PyTorch 2.1.2

aws-neuronx-tools, neuronx_distributed, torch-neuronx

training

trn1 and inf2

3.10 (py310)

https://gallery.ecr.aws/neuron/pytorch-training-neuronx

https://github.com/aws-neuron/deep-learning-containers#pytorch-training-neuronx

PyTorch 1.13.1

aws-neuronx-tools, torch-neuron

inference

inf1

3.10 (py310)

https://gallery.ecr.aws/neuron/pytorch-inference-neuron

https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuron

torchserve

PyTorch 1.13.1

aws-neuronx-tools, neuronx_distributed, torch-neuronx, transformers-neuronx

inference

trn1 and inf2

3.10 (py310)

https://gallery.ecr.aws/neuron/pytorch-inference-neuronx

https://github.com/aws-neuron/deep-learning-containers#pytorch-inference-neuronx

torchserve

PyTorch 1.13.1

aws-neuronx-tools, neuronx_distributed, torch-neuronx

training

trn1 and inf2

3.10 (py310)

https://gallery.ecr.aws/neuron/pytorch-training-neuronx

https://github.com/aws-neuron/deep-learning-containers#pytorch-training-neuronx

Find latest Neuron DLC images.

In the DLC release page do a search for Neuron to get the ECR repo location of specific Neuron DLC release.

Getting Started
Launch Trn1 Instance
  • Please follow the instructions at launch an Amazon EC2 Instance to Launch an instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type.

  • To get more information about instances sizes and pricing see: Trn1 web page, Inf2 web page, Inf1 web page

  • Select your Amazon Machine Image (AMI) of choice, please note that Neuron supports Amazon Linux 2 AMI(HVM) - Kernel 5.10.

  • When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.

  • After launching the instance, follow the instructions in Connect to your instance to connect to the instance

Note

If you are facing a connectivity issue during the model loading process on a Trn1 instance with Ubuntu, that could probably be because of Ubuntu limitations with multiple interfaces. To solve this problem, please follow the steps mentioned here.

Users are highly encouraged to use DLAMI to launch the instances, since DLAMIs come with the required fix.

Install Drivers
# Configure Linux for Neuron repository updates

sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

# Update OS packages
sudo yum update -y

# Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# Remove preinstalled packages and Install Neuron Driver and Runtime
sudo yum remove aws-neuron-dkms -y
sudo yum remove aws-neuronx-dkms -y
sudo yum install aws-neuronx-dkms-2.*  -y

# Install EFA Driver(only required for multi-instance training)
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
cat aws-efa-installer.key | gpg --fingerprint
wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig
tar -xvf aws-efa-installer-latest.tar.gz
cd aws-efa-installer && sudo bash efa_installer.sh --yes
cd
sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer
Install Docker
sudo yum install -y docker.io
sudo usermod -aG docker $USER

Logout and log back in to refresh membership.

Verify Docker
docker run hello-world

Expected result:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/
Verify Neuron Component

Once the environment is setup, a container can be started with –device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be exposed to the container. To find out the available neuron devices on your instance, use the command ls /dev/neuron*.

When running neuron-ls inside a container, you will only see the set of exposed Trainiums. For example:

docker run --device=/dev/neuron0 neuron-test neuron-ls

Would produce the following output in trn1.32xlarge:

+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 10:1c.0 |
+--------+--------+--------+---------+
Build and Run Docker Image
Run Tutorial
Install Drivers
# Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

# Update OS packages
sudo yum update -y

################################################################################################################
# To install or update to Neuron versions 1.19.1 and newer from previous releases:
# - DO NOT skip 'aws-neuron-dkms' install or upgrade step, you MUST install or upgrade to latest Neuron driver
################################################################################################################

# Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# Install Neuron Driver
sudo yum install aws-neuron-dkms -y

####################################################################################
# Warning: If Linux kernel is updated as a result of OS package update
#          Neuron driver (aws-neuron-dkms) should be re-installed after reboot
####################################################################################
Install Docker
sudo yum install -y docker.io
sudo usermod -aG docker $USER

Logout and log back in to refresh membership.

Verify Docker
docker run hello-world

Expected result:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/
Verify Neuron Component

Once the environment is setup, a container can be started with –device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be exposed to the container. To find out the available neuron devices on your instance, use the command ls /dev/neuron*.

When running neuron-ls inside a container, you will only see the set of exposed Inferentias. For example:

docker run --device=/dev/neuron0 neuron-test neuron-ls

Would produce the following output in inf1.xlarge:

+--------------+---------+--------+-----------+-----------+------+------+
|   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
|              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
+--------------+---------+--------+-----------+-----------+------+------+
| 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
+--------------+---------+--------+-----------+-----------+------+------+
Run Tutorial
Kubernetes Getting Started
Prerequisite
  • Working kubernetes cluster

  • Inf1/Trn1 instances as worker nodes with attached roles allowing: * ECR read access policy to retrieve container images from ECR: arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

  • Tutorial Docker environment setup: to install required packages in the worker nodes. With EKS, the EKS optimized accelarated AMI has the necessary neuron components installed

  • Kubernetes node object has instance-type set to inf1/trn1 types. For ex, "node.kubernetes.io/instance-type": "inf1.2xlarge" or "node.kubernetes.io/instance-type": "trn1.2xlarge"

Prerequisite for Neuron Problem Detector Plugin

Neuron node problem detection and recovery is authorized via IAM roles for service accounts. For more information, see IAM roles for service accounts in the Amazon EKS User Guide. This documentation shows how to configure an IAM role for service accounts using the command line tool eksctl. Follow the instructions below to configure IAM authorization for service accounts:

  • Install the eksctl CLI using instructions listed at https://eksctl.io/installation/.

  • Create a policy as shown below:

    Policy template

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "autoscaling:SetInstanceHealth",
                    "autoscaling:DescribeAutoScalingInstances"
                ],
                "Effect": "Allow",
                "Resource": <arn of the Auto Scaling group corresponding to the Neuron nodes for the cluster>
            },
            {
                "Action": [
                    "ec2:DescribeInstances"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "ForAllValues:StringEquals": {
                        "ec2:ResourceTag/aws:autoscaling:groupName": <name of the Auto Scaling group corresponding to the Neuron nodes for the cluster>
                    }
                }
            },
            {
                "Action": [
                    "cloudwatch:PutMetricData"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "cloudwatch:Namespace": "NeuronHealthCheck"
                    }
                }
            }
        ]
    }
    

    To create the policy, the AWS CLI can be used as shown below, where npd-policy-trimmed.json is the JSON policy constructed from the template above.

    aws iam create-policy   \
      --policy-name NeuronProblemDetectorPolicy \
      --policy-document file://npd-policy-trimmed.json
    
  • Associate the authorization with the service account using the following script:

    #!/bin/bash
    CLUSTER_NAME=<eks cluster name>
    REGION_CODE=$(aws configure get region)
    POLICY_ARN=<policy arn for NeuronProblemDetectorPolicy>
    
    eksctl create iamserviceaccount \
        --name node-problem-detector \
        --namespace neuron-healthcheck-system \
        --cluster $CLUSTER_NAME \
        --attach-policy-arn $POLICY_ARN \
        --approve \
        --role-name neuron-problem-detector-role-$CLUSTER_NAME \
        --region $REGION_CODE \
        --override-existing-serviceaccounts
    
  • Verify that the service account is annotated correctly. An example is shown below:

    kubectl describe sa node-problem-detector -n neuron-healthcheck-system
    Name:                node-problem-detector
    Namespace:           neuron-healthcheck-system
    Labels:              app.kubernetes.io/managed-by=eksctl
    Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/neuron-problem-detector-role-cluster1
    Image pull secrets:  <none>
    Mountable secrets:   <none>
    Tokens:              <none>
    Events:              <none>
    
  • To cleanup, deletion of the service account can be done using the following command:

    #!/bin/bash
    CLUSTER_NAME=<eks cluster name>
    REGION_CODE=$(aws configure get region)
    
    eksctl delete iamserviceaccount \
        --name node-problem-detector \
        --namespace neuron-healthcheck-system \
        --cluster $CLUSTER_NAME \
        --approve \
        --region $REGION_CODE \
    
Deploy Neuron Device Plugin

Neuron device plugin exposes Neuron cores & devices to kubernetes as a resource. aws.amazon.com/neuroncore, aws.amazon.com/neurondevice, aws.amazon.com/neuron are the resources that the neuron device plugin registers with the kubernetes. aws.amazon.com/neuroncore is used for allocating neuron cores to the container. aws.amazon.com/neurondevice is used for allocating neuron devices to the container. When neurondevice is used all the cores belonging to the device will be allocated to container. aws.amazon.com/neuron also allocates neurondevices. Resource name ‘neuron’ is recommended for allocating devices to the container. Neuron will be ending support of resource name ‘neurondevice’ in a future release. Please check announcements for updates.

  • Make sure prequisite are satisified

  • Download the neuron device plugin yaml file. k8s-neuron-device-plugin.yml

  • Download the neuron device plugin rbac yaml file. This enables permissions for device plugin to update the node and Pod annotations. k8s-neuron-device-plugin-rbac.yml

  • Apply the Neuron device plugin as a daemonset on the cluster with the following command

    kubectl apply -f k8s-neuron-device-plugin-rbac.yml
    kubectl apply -f k8s-neuron-device-plugin.yml
    
  • Verify that neuron device plugin is running

    kubectl get ds neuron-device-plugin-daemonset --namespace kube-system
    

    Expected result (with 2 nodes in cluster):

    NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    neuron-device-plugin-daemonset   2         2         2       2            2           <none>          27h
    
  • Verify that the node has allocatable neuron cores and devices with the following command

    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"
    

    Expected result:

    NAME                                          NeuronCore
    ip-192-168-65-41.us-west-2.compute.internal   32
    ip-192-168-87-81.us-west-2.compute.internal   32
    
    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neurondevice"
    

    Expected result:

    NAME                                          NeuronDevice
    ip-192-168-65-41.us-west-2.compute.internal   16
    ip-192-168-87-81.us-west-2.compute.internal   16
    
Deploy Neuron Scheduler Extension

The Neuron scheduler extension is required for scheduling pods that require more than one Neuron core or device resource. For a graphical depiction of how the Neuron scheduler extension works, see Neuron Scheduler Extension Flow Diagram. The Neuron scheduler extension finds sets of directly connected devices with minimal communication latency when scheduling containers. On Inf1 and Inf2 instance types where Neuron devices are connected through a ring topology, the scheduler finds sets of contiguous devices. For example, for a container requesting 3 Neuron devices the scheduler might assign Neuron devices 0,1,2 to the container if they are available but never devices 0,2,4 because those devices are not directly connected. On Trn1.32xlarge and Trn1n.32xlarge instance types where devices are connected through a 2D torus topology, the Neuron scheduler enforces additional constraints that containers request 1, 4, 8, or all 16 devices. If your container requires a different number of devices, such as 2 or 5, we recommend that you use an Inf2 instance instead of Trn1 to benefit from more advanced topology.

The Neuron scheduler extension applies different rules when finding devices to allocate to a container on Inf1 and Inf2 instances than on Trn1. These rules ensure that when users request a specific number of resources, Neuron delivers consistent and high performance regardless of which cores and devices are assigned to the container.

On Inf1 and Inf2 Neuron devices are connected through a ring topology. There are no restrictions on the number of devices requested as long as it is fewer than the number of devices on a node. When the user requests N devices, the scheduler finds a node where N contiguous devices are available. It will never allocate non-contiguous devices to the same container. The figure below shows examples of device sets on an Inf2.48xlarge node which could be assigned to a container given a request for 2 devices.

eks-inf2-device-set

Devices on Trn1.32xlarge and Trn1n.32xlarge nodes are connected via a 2D torus topology. On Trn1 nodes containers can request 1, 4, 8, or all 16 devices. In the case you request an invalid number of devices, such as 7, your pod will not be scheduled and you will receive a warning Instance type trn1.32xlarge does not support requests for device: 7. Please request a different number of devices.`.

When requesting 4 devices, your container will be allocated one of the following sets of devices if they are available. eks-trn1-device-set4

When requesting 8 devices, your container will be allocated one of the following sets of devices if they are available. eks-trn1-device-set8

For all instance types, requesting one or all Neuron cores or devices is valid.

In cluster environments where there is no access to default scheduler, the neuron scheduler extension can be used with another scheduler. A new scheduler is added (along with the default scheduler) and then the pod’s that needs to run the neuron workload use this new scheduler. Neuron scheduler extension is added to this new scheduler. EKS natively does not yet support the neuron scheduler extension and so in the EKS environment this is the only way to add the neuron scheduler extension.

  • Make sure Neuron device plugin is running

  • Download the my scheduler my-scheduler.yml

  • Download the scheduler extension k8s-neuron-scheduler-eks.yml

  • Apply the neuron-scheduler-extension

    kubectl apply -f k8s-neuron-scheduler-eks.yml
    
  • Apply the my-scheduler.yml

    kubectl apply -f my-scheduler.yml
    
  • Check there are no errors in the my-scheduler pod logs and the k8s-neuron-scheduler pod is bound to a node

    kubectl logs -n kube-system my-scheduler-79bd4cb788-hq2sq
    
    I1012 15:30:21.629611       1 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/k8s-neuron-scheduler-5d9d9d7988-xcpqm" node="ip-192-168-2-25.ec2.internal" evaluatedNodes=1 feasibleNodes=1
    
  • When running new pod’s that need to use the neuron scheduler extension, make sure it uses the my-scheduler as the scheduler. Sample pod spec is below

    apiVersion: v1
    kind: Pod
    metadata:
    name: <POD_NAME>
    spec:
    restartPolicy: Never
    schedulerName: my-scheduler
    containers:
        - name: <POD_NAME>
        command: ["<COMMAND>"]
        image: <IMAGE_NAME>
        resources:
            limits:
            cpu: "4"
            memory: 4Gi
            aws.amazon.com/neuroncore: 9
            requests:
            cpu: "1"
            memory: 1Gi
    
  • Once the neuron workload pod is run, make sure logs in the k8s neuron scheduler has successfull filter/bind request

    kubectl logs -n kube-system k8s-neuron-scheduler-5d9d9d7988-xcpqm
    
    2022/10/12 15:41:16 POD nrt-test-5038 fits in Node:ip-192-168-2-25.ec2.internal
    2022/10/12 15:41:16 Filtered nodes: [ip-192-168-2-25.ec2.internal]
    2022/10/12 15:41:16 Failed nodes: map[]
    2022/10/12 15:41:16 Finished Processing Filter Request...
    
    2022/10/12 15:41:16 Executing Bind Request!
    2022/10/12 15:41:16 Determine if the pod %v is NeuronDevice podnrt-test-5038
    2022/10/12 15:41:16 Updating POD Annotation with alloc devices!
    2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
    2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neuroncore in node: ip-192-168-2-25.ec2.internal is [false false false false false false false false false false false false false false false false]
    2022/10/12 15:41:16 Allocated ids for POD nrt-test-5038 are: 0,1,2,3,4,5,6,7,8
    2022/10/12 15:41:16 Try to bind pod nrt-test-5038 in default namespace to node ip-192-168-2-25.ec2.internal with &Binding{ObjectMeta:{nrt-test-5038    8da590b1-30bc-4335-b7e7-fe574f4f5538  0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Target:ObjectReference{Kind:Node,Namespace:,Name:ip-192-168-2-25.ec2.internal,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
    2022/10/12 15:41:16 Updating the DevUsageMap since the bind is successful!
    2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
    2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neuroncore in node: ip-192-168-2-25.ec2.internal is [false false false false false false false false false false false false false false false false]
    2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neurondevice in node: ip-192-168-2-25.ec2.internal is [false false false false]
    2022/10/12 15:41:16 Allocated devices list 0,1,2,3,4,5,6,7,8 for resource aws.amazon.com/neuroncore
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Allocated devices list [2] for other resource aws.amazon.com/neurondevice
    2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
    2022/10/12 15:41:16 Succesfully updated the DevUsageMap [true true true true true true true true true false false false false false false false]  and otherDevUsageMap [true true true false] after alloc for node ip-192-168-2-25.ec2.internal
    2022/10/12 15:41:16 Finished executing Bind Request...
    
  • Make sure Neuron device plugin is running

  • Download the scheduler config map k8s-neuron-scheduler-configmap.yml

  • Download the scheduler extension k8s-neuron-scheduler.yml

  • Enable the kube-scheduler with option to use configMap for scheduler policy. In your cluster.yml Please update the spec section with the following

    spec:
      kubeScheduler:
      usePolicyConfigMap: true
    
  • Launch the cluster

    kops create -f cluster.yml
    kops create secret --name neuron-test-1.k8s.local sshpublickey admin -i ~/.ssh/id_rsa.pub
    kops update cluster --name neuron-test-1.k8s.local --yes
    
  • Apply the k8s-neuron-scheduler-configmap.yml [Registers neuron-scheduler-extension with kube-scheduler]

    kubectl apply -f k8s-neuron-scheduler-configmap.yml
    
  • Launch the neuron-scheduler-extension

    kubectl apply -f k8s-neuron-scheduler.yml
    
Deploy Neuron Problem Detector And Recovery

Neuron node problem detector and recovery artifact checks the health of Neuron devices on each Kubernetes node. After detecting an unrecoverable Neuron error, it triggers a node replacement. In order to get started with Neuron node problem detector and recovery, make sure that the following requirements are satisfied:

  • The Neuron node problem detector and recovery requires Neuron driver 2.15+, and it requires the runtime to be at SDK 2.18 or later.

  • Make sure prerequisites are satisfied. This includes prerequisites for getting started with Kubernetes containers and prerequisites for the Neuron node problem detector and recovery.

  • Download the Neuron node problem detector and recovery YAML file: k8s-neuron-problem-detector-and-recovery.yml.

    Note

    This YAML pulls the container image from the upstream repository for node problem detector registry.k8s.io/node-problem-detector.

  • Download the Neuron node problem detector and recovery configuration file: k8s-neuron-problem-detector-and-recovery-config.yml.

  • Download the Neuron node problem detector and recovery RBAC YAML file. This enables permissions for the Neuron node problem detector and recovery to update the node condition: k8s-neuron-problem-detector-and-recovery-rbac.yml.

  • By default, the Neuron node problem detector and recovery has monitor only mode enabled. To enable the recovery functionality, update the environment variable in the YAML file:

    - name: ENABLE_RECOVERY
      value: "true"
    

Apply the Neuron node problem detector and recovery as a DaemonSet on the cluster with the following command:

kubectl create ns neuron-healthcheck-system
kubectl apply -f k8s-neuron-problem-detector-and-recovery-rbac.yml
kubectl apply -f k8s-neuron-problem-detector-and-recovery-config.yml
kubectl apply -f k8s-neuron-problem-detector-and-recovery.yml

Verify that the Neuron device plugin is running:

kubectl get pod -n neuron-healthcheck-system

Expected result (with 4 nodes in cluster):

NAME                          READY   STATUS    RESTARTS   AGE
node-problem-detector-7qcrj   1/1     Running   0          59s
node-problem-detector-j45t5   1/1     Running   0          59s
node-problem-detector-mr2cl   1/1     Running   0          59s
node-problem-detector-vpjtk   1/1     Running   0          59s

When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.

Deploy Neuron Monitor Daemonset

Neuron monitor is primary observability tool for neuron devices. For details of neuron monitor, please refer to the neuron monitor guide. This tutorial describes deploying neuron monitor as a daemonset on the kubernetes cluster.

  • Download the neuron monitor yaml file. k8s-neuron-monitor-daemonset.yml

  • Apply the Neuron monitor yaml to create a daemonset on the cluster with the following command

    kubectl apply -f k8s-neuron-monitor.yml
    
  • Verify that neuron monitor daemonset is running

    kubectl get ds neuron-monitor --namespace neuron-monitor
    

    Expected result (with 2 nodes in cluster):

    NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    neuron-monitor                     2         2         2       2            2           <none>          27h
    
  • Get the neuron-monitor pod names
    kubectl get pods
    

    Expected result

    NAME                   READY   STATUS    RESTARTS   AGE
    neuron-monitor-slsxf   1/1     Running   0          17m
    neuron-monitor-wc4f5   1/1     Running   0          17m
    
  • Verify the prometheus endpoint is available
    kubectl exec neuron-monitor-wc4f5 -- wget -q --output-document - http://127.0.0.1:8000
    

    Expected result

    # HELP python_gc_objects_collected_total Objects collected during gc
    # TYPE python_gc_objects_collected_total counter
    python_gc_objects_collected_total{generation="0"} 362.0
    python_gc_objects_collected_total{generation="1"} 0.0
    python_gc_objects_collected_total{generation="2"} 0.0
    # HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
    # TYPE python_gc_objects_uncollectable_total counter
    
Tutorials
Developer Flows
FAQ, Troubleshooting and Release Note

This document is relevant for: Inf1, Inf2, Trn1, Trn1n