This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
Please refer to EKS instructions to create a cluster. Once the cluster is ACTIVE, please add nodes to the cluster. We recommend using node template for neuron nodes. Following example demonstrates how to add neuron nodes using node template. The example adds managed nodes using eksctl tool. For more details, please refer to EKS User Guide.
As first step, please create a script to capture the parameters for the node template:
#!/bin/bash
CLUSTER_NAME=$1
CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json|jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json|jq -r ".[0].ResourcesVpcConfig.VpcId")
cat <<EOF > cfn_params.json
[
{
"ParameterKey": "ClusterName",
"ParameterValue": "$CLUSTER_NAME"
},
{
"ParameterKey": "ClusterControlPlaneSecurityGroup",
"ParameterValue": "$CLUSTER_SG"
},
{
"ParameterKey": "VpcId",
"ParameterValue": "$VPC_ID"
}
]
EOF
These parameters include the name of the cluster, the security group the nodes can use to connect to the control plane and the vpcid. Next, get the node group template from tutorial below -
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn1_ng_stack.yaml
This template file has a few important config settings -
It places the node in a placement group. This optimizes the network speed between the nodes.
The template installs the EFA driver. Please note that the libfabric version should match between the AMI and the workload containers.
It uses the EKS optimized accelerated AMI which has the necessary neuron components installed. The template uses AMI for Kubernetes version 1.25. Please update to appropriate version.
The template adds trn1.32xlarge nodes to the cluster. Please update to the desired instance type.
Trn2 instance types use a default LNC (Logical NeuronCore Configuration) setting of 2, if you want to change it to 1, update the UserData section of the launch template to a new LNC setting as shown below, and deploy the new/updated version of launch template.
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -ex
config_dir=/opt/aws/neuron
config_file=${config_dir}/logical_nc_config
[ -d "$config_dir" ] || mkdir -p "$config_dir"
[ -f "$config_file" ] || touch "$config_file"
if ! grep -q "^NEURON_LOGICAL_NC_CONFIG=1$" "$config_file" 2>/dev/null; then
printf "NEURON_LOGICAL_NC_CONFIG=1" >> "$config_file"
fi
--==BOUNDARY==--
Finally, run the following command to create cloud formation stack:
aws cloudformation create-stack \
--stack-name eks-trn1-ng-stack \
--template-body file://eks_trn1_ng_stack.yaml \
--parameters file://cfn_params.json \
--capabilities CAPABILITY_IAM
The above command will create a stack named eks-trn1-ng-stack, which will be visible in cloudformation. Please wait for that stack creation to complete before proceeding to next step.
Now we are ready to add the nodes. The example will demonstrate creating node groups using eksctl tool.
Please run following command to determine the AZs:
aws ec2 describe-availability-zones \
--region $REGION_CODE \
--query "AvailabilityZones[]" \
--filters "Name=zone-id,Values=$1" \
--query "AvailabilityZones[].ZoneName" \
--output text
Next, create a script named create_ng_yaml.sh to generate node group yaml. The arguments to the script include the region, AZs, cluster name and name of the cloudformation stack created earlier (eks-trn1-ng-stack in case of this example):
#!/bin/bash
REGION_CODE=$1
EKSAZ1=$2
EKSAZ2=$3
CLUSTER_NAME=$4
STACKNAME=$5
LT_ID_TRN1=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
--query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn1'].OutputValue" \
--output text)
cat <<EOF > trn1_nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: $CLUSTER_NAME
region: $REGION_CODE
version: "1.28"
iam:
withOIDC: true
availabilityZones: ["$EKSAZ1","$EKSAZ2"]
managedNodeGroups:
- name: trn1-32xl-ng1
launchTemplate:
id: $LT_ID_TRN1
minSize: 1
desiredCapacity: 1
maxSize: 1
availabilityZones: ["$EKSAZ1"]
privateNetworking: true
efaEnabled: true
EOF
Run the above script. It should produce a yaml similar to -
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: nemo2
region: us-west-2
version: "1.25"
iam:
withOIDC: true
availabilityZones: ["us-west-2d","us-west-2c"]
managedNodeGroups:
- name: trn1-32xl-ng1
launchTemplate:
id: lt-093c222b35ea89009
minSize: 1
desiredCapacity: 1
maxSize: 1
availabilityZones: ["us-west-2d"]
privateNetworking: true
efaEnabled: true
The example shows kubernetes version 1.25. Please update the version as needed. This yaml can now be used with eksctl.
eksctl create nodegroup -f trn1_nodegroup.yaml
This will add the nodes to the cluster. Please wait for the nodes to be ‘Ready’. This can be verified using the get node command.
kubectl get node
If you are running a distributed training or inference job, you will need EFA resources. Please install the EFA device plugin using instructions at EFA device plugin repository.
Next, we will install the Neuron Device Plugin.
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2