This document is relevant for: Inf1, Inf2, Trn1, Trn1n

Deploy Containers with Neuron#

In this section you will find resources to help you use containers for your accelerated deep learning model acceleration on top of Inferentia and Trainium enabled instances.

The section is organized based on the target deployment environment and use case. In most cases, it is recommended to use a preconfigured Deep Learning Container (DLC) from AWS. Each DLC is pre-configured to have all of the Neuron components installed and is specific to the chosen ML Framework.

Locate Neuron DLC image

Latest Neuron DLC images

Find latest Neuron DLC images.

Locate specific Neuron DLC release

In the DLC release page do a search for Neuron to get the ECR repo location of specific Neuron DLC release.

Getting Started

Training

Launch Trn1 Instance

Install Drivers

# Configure Linux for Neuron repository updates

sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

# Update OS packages
sudo yum update -y

# Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# Remove preinstalled packages and Install Neuron Driver and Runtime
sudo yum remove aws-neuron-dkms -y
sudo yum remove aws-neuronx-dkms -y
sudo yum install aws-neuronx-dkms-2.*  -y

# Install EFA Driver(only required for multiinstance training)
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key
cat aws-efa-installer.key | gpg --fingerprint
wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig
tar -xvf aws-efa-installer-latest.tar.gz
cd aws-efa-installer && sudo bash efa_installer.sh --yes
cd
sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer

Install Docker

sudo yum install -y docker.io
sudo usermod -aG docker $USER

Logout and log back in to refresh membership.

Verify Docker

docker run hello-world

Expected result:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/

Verify Neuron Component

Once the environment is setup, a container can be started with –device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be exposed to the container. To find out the available neuron devices on your instance, use the command ls /dev/neuron*.

When running neuron-ls inside a container, you will only see the set of exposed Trainiums. For example:

docker run --device=/dev/neuron0 neuron-test neuron-ls

Would produce the following output in trn1.32xlarge:

::

   +--------+--------+--------+---------+
   | NEURON | NEURON | NEURON |   PCI   |
   | DEVICE | CORES  | MEMORY |   BDF   |
   +--------+--------+--------+---------+
   | 0      | 2      | 32 GB  | 10:1c.0 |
   +--------+--------+--------+---------+

Build and Run Docker Image

Tutorial How to Build and Run a Neuron Container

Run Tutorial

Run training in Pytorch Neuron container

Inference

Launch Inf1 Instance

Install Drivers

# Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

# Update OS packages
sudo yum update -y

################################################################################################################
# To install or update to Neuron versions 1.19.1 and newer from previous releases:
# - DO NOT skip 'aws-neuron-dkms' install or upgrade step, you MUST install or upgrade to latest Neuron driver
################################################################################################################

# Install OS headers
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# Install Neuron Driver
sudo yum install aws-neuron-dkms -y

####################################################################################
# Warning: If Linux kernel is updated as a result of OS package update
#          Neuron driver (aws-neuron-dkms) should be re-installed after reboot
####################################################################################

Install Docker

sudo yum install -y docker.io
sudo usermod -aG docker $USER

Logout and log back in to refresh membership.

Verify Docker

docker run hello-world

Expected result:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/

Verify Neuron Component

When running neuron-ls inside a container, you will only see the set of exposed Inferentias. For example:

docker run --device=/dev/neuron0 neuron-test neuron-ls

Would produce the following output in inf1.xlarge:

::

   +--------------+---------+--------+-----------+-----------+------+------+
   |   PCI BDF    | LOGICAL | NEURON |  MEMORY   |  MEMORY   | EAST | WEST |
   |              |   ID    | CORES  | CHANNEL 0 | CHANNEL 1 |      |      |
   +--------------+---------+--------+-----------+-----------+------+------+
   | 0000:00:1f.0 |       0 |      4 | 4096 MB   | 4096 MB   |    0 |    0 |
   +--------------+---------+--------+-----------+-----------+------+------+

Run Tutorial

Run inference in pytorch neuron container

Kubernetes Getting Started

Prerequisite

Working kubernetes cluster
Inf1/Trn1 instances as worker nodes with attached roles allowing: * ECR read access policy to retrieve container images from ECR: arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
Tutorial Docker environment setup: to install required packages in the worker nodes. With EKS, the EKS optimized accelarated AMI has the necessary neuron components installed
Kubernetes node object has instance-type set to inf1/trn1 types. For ex, "node.kubernetes.io/instance-type": "inf1.2xlarge" or "node.kubernetes.io/instance-type": "trn1.2xlarge"

Deploy Neuron Device Plugin

Neuron device plugin exposes Neuron cores & devices to kubernetes as a resource. aws.amazon.com/neuroncore, aws.amazon.com/neurondevice, aws.amazon.com/neuron are the resources that the neuron device plugin registers with the kubernetes. aws.amazon.com/neuroncore is used for allocating neuron cores to the container. aws.amazon.com/neurondevice is used for allocating neuron devices to the container. When neurondevice is used all the cores belonging to the device will be allocated to container. aws.amazon.com/neuron also allocates neurondevices and this exists just to be backward compatible with already existing installations. aws.amazon.com/neurondevice is the recommended resource for allocating devices to the container.

Make sure prequisite are satisified
Download the neuron device plugin yaml file. k8s-neuron-device-plugin.yml
Download the neuron device plugin rbac yaml file. This enables permissions for device plugin to update the node and Pod annotations. k8s-neuron-device-plugin-rbac.yml

Apply the Neuron device plugin as a daemonset on the cluster with the following command

kubectl apply -f k8s-neuron-device-plugin-rbac.yml
kubectl apply -f k8s-neuron-device-plugin.yml

Verify that neuron device plugin is running

kubectl get ds neuron-device-plugin-daemonset --namespace kube-system

Expected result (with 2 nodes in cluster):

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
neuron-device-plugin-daemonset   2         2         2       2            2           <none>          27h

Verify that the node has allocatable neuron cores and devices with the following command

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronCore:.status.allocatable.aws\.amazon\.com/neuroncore"

Expected result:

NAME                                          NeuronCore
ip-192-168-65-41.us-west-2.compute.internal   32
ip-192-168-87-81.us-west-2.compute.internal   32

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,NeuronDevice:.status.allocatable.aws\.amazon\.com/neurondevice"

Expected result:

NAME                                          NeuronDevice
ip-192-168-65-41.us-west-2.compute.internal   16
ip-192-168-87-81.us-west-2.compute.internal   16

Deploy Neuron Scheduler Extension

Neuron scheduler extension is required for scheduling pods that require more than one Neuron core or device resource. Refer Neuron Scheduler Extension Flow Diagram for details on how the neuron scheduler extension works. Neuron scheduler extension filter out nodes with non-contiguous core/device ids and enforces allocation of contiguous core/device ids for the PODs requiring it.

Multiple Scheduler Approach

In cluster environments where there is no access to default scheduler, the neuron scheduler extension can be used with another scheduler. A new scheduler is added (along with the default scheduler) and then the pod’s that needs to run the neuron workload use this new scheduler. Neuron scheduler extension is added to this new scheduler. EKS natively does not yet support the neuron scheduler extension and so in the EKS environment this is the only way to add the neuron scheduler extension.

Make sure Neuron device plugin is running
Download the my scheduler my-scheduler.yml
Download the scheduler extension k8s-neuron-scheduler-eks.yml

Apply the neuron-scheduler-extension

kubectl apply -f k8s-neuron-scheduler-eks.yml

Apply the my-scheduler.yml
kubectl apply -f my-scheduler.yml

Check there are no errors in the my-scheduler pod logs and the k8s-neuron-scheduler pod is bound to a node

kubectl logs -n kube-system my-scheduler-79bd4cb788-hq2sq

I1012 15:30:21.629611       1 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/k8s-neuron-scheduler-5d9d9d7988-xcpqm" node="ip-192-168-2-25.ec2.internal" evaluatedNodes=1 feasibleNodes=1

When running new pod’s that need to use the neuron scheduler extension, make sure it uses the my-scheduler as the scheduler. Sample pod spec is below

apiVersion: v1
kind: Pod
metadata:
name: <POD_NAME>
spec:
restartPolicy: Never
schedulerName: my-scheduler
containers:
    - name: <POD_NAME>
    command: ["<COMMAND>"]
    image: <IMAGE_NAME>
    resources:
        limits:
        cpu: "4"
        memory: 4Gi
        aws.amazon.com/neuroncore: 9
        requests:
        cpu: "1"
        memory: 1Gi

Once the neuron workload pod is run, make sure logs in the k8s neuron scheduler has successfull filter/bind request

kubectl logs -n kube-system k8s-neuron-scheduler-5d9d9d7988-xcpqm

2022/10/12 15:41:16 POD nrt-test-5038 fits in Node:ip-192-168-2-25.ec2.internal
2022/10/12 15:41:16 Filtered nodes: [ip-192-168-2-25.ec2.internal]
2022/10/12 15:41:16 Failed nodes: map[]
2022/10/12 15:41:16 Finished Processing Filter Request...

2022/10/12 15:41:16 Executing Bind Request!
2022/10/12 15:41:16 Determine if the pod %v is NeuronDevice podnrt-test-5038
2022/10/12 15:41:16 Updating POD Annotation with alloc devices!
2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neuroncore in node: ip-192-168-2-25.ec2.internal is [false false false false false false false false false false false false false false false false]
2022/10/12 15:41:16 Allocated ids for POD nrt-test-5038 are: 0,1,2,3,4,5,6,7,8
2022/10/12 15:41:16 Try to bind pod nrt-test-5038 in default namespace to node ip-192-168-2-25.ec2.internal with &Binding{ObjectMeta:{nrt-test-5038    8da590b1-30bc-4335-b7e7-fe574f4f5538  0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Target:ObjectReference{Kind:Node,Namespace:,Name:ip-192-168-2-25.ec2.internal,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
2022/10/12 15:41:16 Updating the DevUsageMap since the bind is successful!
2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neuroncore in node: ip-192-168-2-25.ec2.internal is [false false false false false false false false false false false false false false false false]
2022/10/12 15:41:16 neuronDevUsageMap for resource:aws.amazon.com/neurondevice in node: ip-192-168-2-25.ec2.internal is [false false false false]
2022/10/12 15:41:16 Allocated devices list 0,1,2,3,4,5,6,7,8 for resource aws.amazon.com/neuroncore
2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [0] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [1] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Allocated devices list [2] for other resource aws.amazon.com/neurondevice
2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
2022/10/12 15:41:16 Succesfully updated the DevUsageMap [true true true true true true true true true false false false false false false false]  and otherDevUsageMap [true true true false] after alloc for node ip-192-168-2-25.ec2.internal
2022/10/12 15:41:16 Finished executing Bind Request...

Default Scheduler Approach

Make sure Neuron device plugin is running
Download the scheduler config map k8s-neuron-scheduler-configmap.yml
Download the scheduler extension k8s-neuron-scheduler.yml
Enable the kube-scheduler with option to use configMap for scheduler policy. In your cluster.yml Please update the spec section with the following
spec: kubeScheduler: usePolicyConfigMap: true

Launch the cluster

kops create -f cluster.yml
kops create secret --name neuron-test-1.k8s.local sshpublickey admin -i ~/.ssh/id_rsa.pub
kops update cluster --name neuron-test-1.k8s.local --yes

Apply the k8s-neuron-scheduler-configmap.yml [Registers neuron-scheduler-extension with kube-scheduler]
kubectl apply -f k8s-neuron-scheduler-configmap.yml

Launch the neuron-scheduler-extension

kubectl apply -f k8s-neuron-scheduler.yml

Tutorials

Training

Run training in Pytorch Neuron container
Deploy a simple mlp training script as a Kubernetes job

Inference

Run inference in pytorch neuron container
Deploy a TensorFlow Resnet50 model as a Kubernetes service

Developer Flows

Inference

Deploy Neuron Container on EC2
Deploy Neuron Container on Elastic Container Service (ECS)
Deploy Neuron Container on Elastic Kubernetes Service (EKS)
Bring Your Own Neuron Container to Sagemaker Hosting

FAQ, Troubleshooting and Release Note

FAQ
Troubleshooting Neuron Containers
Neuron Containers Release Notes
Neuron K8 Release Notes

This document is relevant for: Inf1, Inf2, Trn1, Trn1n

AWS Neuron Documentation

Deploy Containers with Neuron

Deploy Containers with Neuron#