EKS prerequisites#
Before setting up Neuron components on your EKS cluster, you must create an EKS cluster and add Neuron-enabled nodes. This section guides you through creating an Amazon Elastic Kubernetes Service (EKS) cluster with AWS Trainium-enabled nodes (Trn1 or Trn2 instances) using CloudFormation templates and the eksctl command-line tool. You’ll configure optimized networking with Elastic Fabric Adapter (EFA) support and pre-configured Neuron components for distributed training and inference workloads.
For detailed information, refer to:
Step 1: Download Node Group Template
Download the node group CloudFormation template for your instance type.
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn1_ng_stack.yaml
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn2_ng_stack_al2023.yaml
Important template configuration information
Placement Group: Optimizes network speed between nodes
EFA Driver: Installed automatically (ensure
libfabricversion matches between AMI and workload containers)AMI: Uses EKS optimized accelerated AMI with Neuron components pre-installed
Instance Type: Configured for trn1.32xlarge or trn2.48xlarge (update to your desired instance type)
Kubernetes Version: Trn1 templates use Kubernetes 1.25+, Trn2 templates use Kubernetes 1.34+ (update as needed)
Trn2 LNC configuration (Optional):
Trn2 instances use a default Logical NeuronCore Configuration (LNC) of 2. To change it to 1, update the UserData section of the launch template:
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -ex
config_dir=/opt/aws/neuron
config_file=${config_dir}/logical_nc_config
[ -d "$config_dir" ] || mkdir -p "$config_dir"
[ -f "$config_file" ] || touch "$config_file"
if ! grep -q "^NEURON_LOGICAL_NC_CONFIG=1$" "$config_file" 2>/dev/null; then
printf "NEURON_LOGICAL_NC_CONFIG=1" >> "$config_file"
fi
--==BOUNDARY==--
Step 2: Create Cluster Parameter Script
Create a bash script to capture the parameters needed for the node template:
#!/bin/bash
CLUSTER_NAME=$1
CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.VpcId")
cat <<EOF > cfn_params.json
[
{
"ParameterKey": "ClusterName",
"ParameterValue": "$CLUSTER_NAME"
},
{
"ParameterKey": "ClusterControlPlaneSecurityGroup",
"ParameterValue": "$CLUSTER_SG"
},
{
"ParameterKey": "VpcId",
"ParameterValue": "$VPC_ID"
}
]
EOF
#!/bin/bash
CLUSTER_NAME=$1
CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.VpcId")
CLUSTER_ENDPOINT=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].Endpoint")
CLUSTER_SERVICE_CIDR=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].KubernetesNetworkConfig.ServiceIpv4Cidr")
CLUSTER_CA=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].CertificateAuthority.Data")
cat <<EOF > cfn_params.json
[
{
"ParameterKey": "ClusterName",
"ParameterValue": "$CLUSTER_NAME"
},
{
"ParameterKey": "ClusterControlPlaneSecurityGroup",
"ParameterValue": "$CLUSTER_SG"
},
{
"ParameterKey": "VpcId",
"ParameterValue": "$VPC_ID"
},
{
"ParameterKey": "ClusterEndpoint",
"ParameterValue": "$CLUSTER_ENDPOINT"
},
{
"ParameterKey": "ClusterServiceCidr",
"ParameterValue": "$CLUSTER_SERVICE_CIDR"
},
{
"ParameterKey": "ClusterCertificateAuthority",
"ParameterValue": "$CLUSTER_CA"
}
]
EOF
This script captures the cluster name, security group for control plane connectivity, and VPC ID.
Step 3: Create CloudFormation Stack
Create the CloudFormation stack for the node group.
aws cloudformation create-stack \
--stack-name eks-trn1-ng-stack \
--template-body file://eks_trn1_ng_stack.yaml \
--parameters file://cfn_params.json \
--capabilities CAPABILITY_IAM
aws cloudformation create-stack \
--stack-name eks-trn2-ng-stack \
--template-body file://eks_trn2_ng_stack_al2023.yaml \
--parameters file://cfn_params.json \
--capabilities CAPABILITY_IAM
Wait for the stack creation to complete before proceeding. You can monitor the progress in the AWS CloudFormation console.
Step 4: Determine Availability Zones
Identify the availability zones for your cluster:
aws ec2 describe-availability-zones \
--region $REGION_CODE \
--query "AvailabilityZones[]" \
--filters "Name=zone-id,Values=$1" \
--query "AvailabilityZones[].ZoneName" \
--output text
Step 5: Generate Node Group Configuration
Create a script named create_ng_yaml.sh to generate the node group YAML configuration. The script requires: region, availability zones, cluster name, and CloudFormation stack name.
#!/bin/bash
REGION_CODE=$1
EKSAZ1=$2
EKSAZ2=$3
CLUSTER_NAME=$4
STACKNAME=$5
LT_ID_TRN1=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
--query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn1'].OutputValue" \
--output text)
cat <<EOF > trn1_nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: $CLUSTER_NAME
region: $REGION_CODE
version: "1.28"
iam:
withOIDC: true
availabilityZones: ["$EKSAZ1","$EKSAZ2"]
managedNodeGroups:
- name: trn1-32xl-ng1
launchTemplate:
id: $LT_ID_TRN1
minSize: 1
desiredCapacity: 1
maxSize: 1
availabilityZones: ["$EKSAZ1"]
privateNetworking: true
efaEnabled: true
EOF
#!/bin/bash
REGION_CODE=$1
EKSAZ1=$2
EKSAZ2=$3
CLUSTER_NAME=$4
STACKNAME=$5
LT_ID_TRN2=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
--query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn2'].OutputValue" \
--output text)
cat <<EOF > trn2_nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: $CLUSTER_NAME
region: $REGION_CODE
version: "1.34"
iam:
withOIDC: true
availabilityZones: ["$EKSAZ1","$EKSAZ2"]
managedNodeGroups:
- name: trn2-48xl-ng1
launchTemplate:
id: $LT_ID_TRN2
minSize: 1
desiredCapacity: 1
maxSize: 1
availabilityZones: ["$EKSAZ1"]
privateNetworking: true
efaEnabled: true
EOF
Run the script to generate the configuration file. Update the Kubernetes version as needed for your environment.
Example output:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: nemo2
region: us-west-2
version: "1.28"
iam:
withOIDC: true
availabilityZones: ["us-west-2d","us-west-2c"]
managedNodeGroups:
- name: trn1-32xl-ng1
launchTemplate:
id: lt-093c222b35ea89009
minSize: 1
desiredCapacity: 1
maxSize: 1
availabilityZones: ["us-west-2d"]
privateNetworking: true
efaEnabled: true
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: nemo2
region: us-west-2
version: "1.34"
iam:
withOIDC: true
availabilityZones: ["us-west-2d","us-west-2c"]
managedNodeGroups:
- name: trn2-48xl-ng1
launchTemplate:
id: lt-093c222b35ea89010
minSize: 1
desiredCapacity: 1
maxSize: 1
availabilityZones: ["us-west-2d"]
privateNetworking: true
efaEnabled: true
Step 6: Create Node Group
Create the node group using the generated configuration.
eksctl create nodegroup -f trn1_nodegroup.yaml
eksctl create nodegroup -f trn2_nodegroup.yaml
Wait for the nodes to reach the Ready state. Verify using:
kubectl get nodes
Step 7: Install EFA Device Plugin (Optional)
If you plan to run distributed training or inference jobs, install the EFA device plugin following the instructions at the EFA device plugin repository.