This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
How to schedule MPI jobs to run on Neuron UltraServer on EKS#
Overview#
Trn2 UltraServers represent a sophisticated computing infrastructure designed to connect multiple Trainium instances through NeuronLinkV3 (Read more here: Amazon EC2 Trn2 Architecture). For many advanced and complex models, customers can use UltraServers to greatly reduce training and inference times compared to previous distributed job setups.
This page explains the two setups needed to properly schedule and run MPI jobs on the Neuron UltraServer on EKS:
UltraServer init script for the launcher pod
Affinity configuration for the worker pods
How it works#
The UltraServer init script will:
Validate the node config and deployment of the MPI job worker pods
Write environment variables that are required for runtime to each MPI worker pod
Write a new hostfile to
/root/ultraserver_init/new_hostfile
The validation process includes making sure the node config is a valid number (4, 2, or 1), and that the worker pods are deployed correctly to UltraServer nodes. More about the how to set the node config can been found below.
The environment variables that are being written are:
NEURON_GLOBAL_TOPOID: The topology ID of the worker pod
NEURON_GLOBAL_TOPOID0_HOST: The FQDN of the worker pod that’s the “leader” (topology ID of 0)
NEURON_RT_ULTRASERVER_MODE: The mode of the UltraServer node that’s passed to the Neuron runtime
NEURON_RT_ULTRASERVER_SERVER_ID: The server ID of the UltraServer node that’s passed to the Neuron runtime
NEURON_RT_ULTRASERVER_NODE_ID: The node ID of the UltraServer node that’s passed to the Neuron runtime
The affinity performs two functions:
Prevents worker pods from being scheduled together with worker pods from other jobs
Requires/Encourages worker pods from the same job to be scheduled together
These configurations are needed in order to properly schedule your MPI job worker pods.
The pod anti-affinity prevents scheduling your workload onto UltraServer topologies where worker pods from other jobs already exist. For example, if you have an UltraServer that already has a 2-node job running on it, the pod anti-affinity will prevent scheduling a 4-node job on that UltraServer since 2 of the 4 nodes are already occupied.
The pod affinity will make sure that worker pods of the same job are scheduled together in the same UltraServer topology. For example, if you have an 2 UltraServers with no jobs running on either of them, the pod affinity would make sure that the worker pods of a 4-node job are all scheduled on the same UltraServer and not split between the two.
Prerequisites#
An EKS cluster with trn2 UltraServers (Using Neuron with Amazon EKS)
Neuron Device Plugin installed on the cluster with version >= 2.26.26.0 (tutorials/k8s-neuron-device-plugin)
MPI operator installed on the cluster
An MPI job spec
Instructions#
UltraServer Init Script#
Download the UltraServer init script k8s-ultraserver-init-script.sh
To use the script, either: - add it to your MPI job Dockerfile and build the image OR - create a new Dockerfile and build a new image from your MPI job image
Example:
FROM 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:mpijob
COPY ultraserver-init-script.sh /tmp/
RUN chmod +x /tmp/ultraserver-init-script.sh
ENTRYPOINT ["/tmp/ultraserver-init-script.sh"]
Then add the 2 required init containers to the launcher pod.
The first init container should utilize the /etc/mpi/discover_hosts.sh script to ensure that all worker pods are ready before continuing on to the UltraServer init script.
The second init container should use the image containing ultraserver-init-script.sh. You can specify a value for NEURON_ULTRASERVER_NODE_CONFIG, which determines what UltraServer node config your MPI job will use, i.e. how many UltraServer nodes to use. Possible values are 4, 2, and 1, and the default value is 4.
Example:
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: &job_name <MPI-JOB-NAME>
namespace: default
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: mpitest
image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:mpijob
...
initContainers:
- name: wait-hostfilename
image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:mpijob
command:
- bash
- -cx
- |
if [[ $(cat /etc/mpi/discover_hosts.sh | wc -l) != 1 ]]; then
date
echo "Ready"
cat /etc/mpi/discover_hosts.sh
else
date
echo "not ready ..."
sleep 10
exit 1
fi
while read host; do
while ! ssh $host echo $host; do
date
echo "Pod $host is not up ..."
sleep 10
done
date
echo "Pod $host is ready"
done <<< "$(/etc/mpi/discover_hosts.sh)"
resources: {}
volumeMounts:
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /root/.ssh
name: ssh-auth
- name: ultraserver-init-container
image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/ultraserver:init-container
env:
- name: NEURON_ULTRASERVER_NODE_CONFIG
value: <"4", "2", OR "1">
volumeMounts:
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /root/.ssh
name: ssh-auth
- mountPath: /root/ultraserver_init
name: ultraserver-init
...
volumes:
- name: ultraserver-init
emptyDir: {}
MPI Worker Pod Affinity#
Single-node Job#
2-node job
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: &job_name <MPI-JOB-NAME>
namespace: default
...
spec:
mpiReplicaSpecs:
Launcher:
...
Worker:
replicas: 2
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: trn2u.48xlarge
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: training.kubeflow.org/job-name
operator: NotIn
values:
- *job_name
matchLabels:
training.kubeflow.org/job-role: worker
topologyKey: neuron.amazonaws.com/ultraserver-server-id-2
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
training.kubeflow.org/job-role: worker
training.kubeflow.org/job-name: *job_name
topologyKey: neuron.amazonaws.com/ultraserver-server-id-2
...
4-node job
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: &job_name <MPI-JOB-NAME>
namespace: default
...
spec:
mpiReplicaSpecs:
Launcher:
...
Worker:
replicas: 4
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: trn2u.48xlarge
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: training.kubeflow.org/job-name
operator: NotIn
values:
- *job_name
matchLabels:
training.kubeflow.org/job-role: worker
topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
training.kubeflow.org/job-role: worker
training.kubeflow.org/job-name: *job_name
topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
...
Multi-node job#
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: &job_name <MPI-JOB-NAME>
namespace: default
...
spec:
mpiReplicaSpecs:
Launcher:
...
Worker:
replicas: 16
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: trn2u.48xlarge
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: training.kubeflow.org/job-name
operator: NotIn
values:
- *job_name
matchLabels:
training.kubeflow.org/job-role: worker
topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
training.kubeflow.org/job-role: worker
training.kubeflow.org/job-name: *job_name
topologyKey: neuron.amazonaws.com/ultraserver-server-id-4
...
To use the affinity configuration, replace <MPI-JOB-NAME> with your MPI job name and add it to your workload yaml spec.
Confirm your work#
To validate that the init container is working:
# Find the worker pods associated with your MPI job
kubectl get pods
# Get the logs of the init container
kubectl logs <LAUNCHER-POD-NAME> -c ultraserver-init-container
You should see logs under the init container.
Example:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
demo-launcher-42lh9 0/1 Init:0/2 0 4s
demo-worker-0 1/1 Running 0 4s
demo-worker-1 1/1 Running 0 4s
demo-worker-2 1/1 Running 0 4s
demo-worker-3 1/1 Running 0 4s
$ kubectl logs demo-launcher-42lh9 -c ultraserver-init-container
Using 4-node config
...
To validate that the affinity configuration is working:
# Find the worker pods and the nodes they are scheduled to
kubectl get pods -o=custom-columns='POD_NAME:metadata.name,NODE_NAME:spec.nodeName'
# Compare the labels of the nodes to the
kubectl get nodes \
-l neuron.amazonaws.com/ultraserver-mode \
-o=custom-columns='NAME:metadata.name,MODE:metadata.labels.neuron\.amazonaws\.com/ultraserver-mode,ULTRASERVER_SERVER_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-2,ULTRASERVER_NODE_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-2,ULTRASERVER_SERVER_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-4,ULTRASERVER_NODE_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-4' | awk 'NR==1{print;next}{print | "sort -k3,3 -k4,4"}'
When looking at the nodes used by the worker pods, they should share the same ULTRASERVER_SERVER_ID_2 or ULTRASERVER_SERVER_ID_4 label based on which config you chose.
Example when choosing a 4-node config:
$ kubectl get pods -o=custom-columns='POD_NAME:metadata.name,NODE_NAME:spec.nodeName'
POD_NAME NODE_NAME
demo-launcher-42lh9 ip-172-32-5-227.ap-southeast-4.compute.internal
demo-worker-0 ip-172-32-5-227.ap-southeast-4.compute.internal
demo-worker-1 ip-172-32-11-17.ap-southeast-4.compute.internal
demo-worker-2 ip-172-32-13-57.ap-southeast-4.compute.internal
demo-worker-3 ip-172-32-9-4.ap-southeast-4.compute.internal
$ kubectl get nodes \
-l neuron.amazonaws.com/ultraserver-mode \
-o=custom-columns='NAME:metadata.name,MODE:metadata.labels.neuron\.amazonaws\.com/ultraserver-mode,ULTRASERVER_SERVER_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-2,ULTRASERVER_NODE_ID_2:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-2,ULTRASERVER_SERVER_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-server-id-4,ULTRASERVER_NODE_ID_4:metadata.labels.neuron\.amazonaws\.com/ultraserver-node-id-4' | awk 'NR==1{print;next}{print | "sort -k3,3 -k4,4"}'
NAME MODE ULTRASERVER_SERVER_ID_2 ULTRASERVER_NODE_ID_2 ULTRASERVER_SERVER_ID_4 ULTRASERVER_NODE_ID_4
ip-172-32-11-17.ap-southeast-4.compute.internal 1_2_4 u5wy80u0o2saugxy 0 bog79p1y8tetj5uu 0
ip-172-32-13-57.ap-southeast-4.compute.internal 1_2_4 u5wy80u0o2saugxy 1 bog79p1y8tetj5uu 1
ip-172-32-5-227.ap-southeast-4.compute.internal 1_2_4 ygml2651y0lwdd46 0 bog79p1y8tetj5uu 2
ip-172-32-9-4.ap-southeast-4.compute.internal 1_2_4 ygml2651y0lwdd46 1 bog79p1y8tetj5uu 3
Common issues#
Init script fails to start#
If at least one of the worker pods isn’t scheduled to a node, the init script will fail to start.
Example:
$ kubectl get pods -o=custom-columns='POD_NAME:metadata.name,NODE_NAME:spec.nodeName'
POD_NAME NODE_NAME
demo-launcher-96xsl ip-172-32-9-4.ap-southeast-4.compute.internal
demo-worker-0 <none>
demo-worker-1 <none>
demo-worker-2 <none>
demo-worker-3 <none>
$ kubectl logs demo-launcher-96xsl -c ultraserver-init-container
Error from server (BadRequest): container "ultraserver-init-container" in pod "demo-launcher-96xsl" is waiting to start: PodInitializing
Possible solution: Check your pods for affinity/scheduling issues.
$ kubectl describe pod demo-worker-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m13s default-scheduler 0/4 nodes are available: 4 node(s) didn't match pod affinity rules. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.