This document is relevant for: Trn1
, Trn1n
Deploy a simple mlp training script as a Kubernetes job#
This tutorial uses mlp train as a teaching example on how to deploy an training application using Kubernetes on the Trn1 instances. For more advanced example, please refer to Tutorial: Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS
Prerequisite:#
EKS Setup For Neuron: to setup k8s support on your cluster.
Trn1 instances as worker nodes with attached roles allowing:
ECR read access policy to retrieve container images from ECR: arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
Have a container image that is build using Run Training in PyTorch Neuron Container
Deploy a mlp training image#
Create a file named mlp_train.yaml with the contents below.
Note
In the image: add the appropriate location of the image
apiVersion: v1
kind: Pod
metadata:
name: trn1-mlp
spec:
restartPolicy: Never
schedulerName: default-scheduler
hostNetwork: true
nodeSelector:
beta.kubernetes.io/instance-type: trn1.32xlarge
beta.kubernetes.io/instance-type: trn1.2xlarge
containers:
- name: trn1-mlp
command: ["/usr/local/bin/python3"]
args: ["/opt/ml/mlp_train.py"]
image: 647554078242.dkr.ecr.us-east-1.amazonaws.com/sunda-pt:k8s_mlp_0907
imagePullPolicy: IfNotPresent
env:
- name: NEURON_RT_LOG_LEVEL
value: "INFO"
resources:
limits:
aws.amazon.com/neuron: 2
requests:
aws.amazon.com/neuron: 2
Deploy the pod.
kubectl apply -f mlp_train.yaml
3. Check the logs to make sure training completed
kubectl logs <pod name>
Your log should have the following
Final loss is 0.1977
----------End Training ---------------
This document is relevant for: Trn1
, Trn1n