This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Before setting up Neuron components on your EKS cluster, you must create an EKS cluster and add Neuron-enabled nodes. This section guides you through creating an Amazon Elastic Kubernetes Service (EKS) cluster with AWS Trainium-enabled nodes (Trn1 or Trn2 instances) using CloudFormation templates and the eksctl command-line tool. You’ll configure optimized networking with Elastic Fabric Adapter (EFA) support and pre-configured Neuron components for distributed training and inference workloads.

For detailed information, refer to:

Step 1: Download Node Group Template

Download the node group CloudFormation template for your instance type.

wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn1_ng_stack.yaml
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-eks-samples/master/dp_bert_hf_pretrain/cfn/eks_trn2_ng_stack_al2023.yaml

Important template configuration information

  • Placement Group: Optimizes network speed between nodes

  • EFA Driver: Installed automatically (ensure libfabric version matches between AMI and workload containers)

  • AMI: Uses EKS optimized accelerated AMI with Neuron components pre-installed

  • Instance Type: Configured for trn1.32xlarge or trn2.48xlarge (update to your desired instance type)

  • Kubernetes Version: Trn1 templates use Kubernetes 1.25+, Trn2 templates use Kubernetes 1.34+ (update as needed)

Trn2 LNC configuration (Optional):

Trn2 instances use a default Logical NeuronCore Configuration (LNC) of 2. To change it to 1, update the UserData section of the launch template:

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
set -ex
config_dir=/opt/aws/neuron
config_file=${config_dir}/logical_nc_config
[ -d "$config_dir" ] || mkdir -p "$config_dir"
[ -f "$config_file" ] || touch "$config_file"
if ! grep -q "^NEURON_LOGICAL_NC_CONFIG=1$" "$config_file" 2>/dev/null; then
    printf "NEURON_LOGICAL_NC_CONFIG=1" >> "$config_file"
fi
--==BOUNDARY==--

Step 2: Create Cluster Parameter Script

Create a bash script to capture the parameters needed for the node template:

#!/bin/bash

CLUSTER_NAME=$1
CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.VpcId")

cat <<EOF > cfn_params.json
[
    {
        "ParameterKey": "ClusterName",
        "ParameterValue": "$CLUSTER_NAME"
    },
    {
        "ParameterKey": "ClusterControlPlaneSecurityGroup",
        "ParameterValue": "$CLUSTER_SG"
    },
    {
        "ParameterKey": "VpcId",
        "ParameterValue": "$VPC_ID"
    }
]
EOF
#!/bin/bash

CLUSTER_NAME=$1
CLUSTER_SG=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.ClusterSecurityGroupId")
VPC_ID=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].ResourcesVpcConfig.VpcId")
CLUSTER_ENDPOINT=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].Endpoint")
CLUSTER_SERVICE_CIDR=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].KubernetesNetworkConfig.ServiceIpv4Cidr")
CLUSTER_CA=$(eksctl get cluster $CLUSTER_NAME -o json | jq -r ".[0].CertificateAuthority.Data")

cat <<EOF > cfn_params.json
[
    {
        "ParameterKey": "ClusterName",
        "ParameterValue": "$CLUSTER_NAME"
    },
    {
        "ParameterKey": "ClusterControlPlaneSecurityGroup",
        "ParameterValue": "$CLUSTER_SG"
    },
    {
        "ParameterKey": "VpcId",
        "ParameterValue": "$VPC_ID"
    },
    {
        "ParameterKey": "ClusterEndpoint",
        "ParameterValue": "$CLUSTER_ENDPOINT"
    },
    {
        "ParameterKey": "ClusterServiceCidr",
        "ParameterValue": "$CLUSTER_SERVICE_CIDR"
    },
    {
        "ParameterKey": "ClusterCertificateAuthority",
        "ParameterValue": "$CLUSTER_CA"
    }
]
EOF

This script captures the cluster name, security group for control plane connectivity, and VPC ID.

Step 3: Create CloudFormation Stack

Create the CloudFormation stack for the node group.

aws cloudformation create-stack \
    --stack-name eks-trn1-ng-stack \
    --template-body file://eks_trn1_ng_stack.yaml \
    --parameters file://cfn_params.json \
    --capabilities CAPABILITY_IAM
aws cloudformation create-stack \
    --stack-name eks-trn2-ng-stack \
    --template-body file://eks_trn2_ng_stack_al2023.yaml \
    --parameters file://cfn_params.json \
    --capabilities CAPABILITY_IAM

Wait for the stack creation to complete before proceeding. You can monitor the progress in the AWS CloudFormation console.

Step 4: Determine Availability Zones

Identify the availability zones for your cluster:

aws ec2 describe-availability-zones \
    --region $REGION_CODE \
    --query "AvailabilityZones[]" \
    --filters "Name=zone-id,Values=$1" \
    --query "AvailabilityZones[].ZoneName" \
    --output text

Step 5: Generate Node Group Configuration

Create a script named create_ng_yaml.sh to generate the node group YAML configuration. The script requires: region, availability zones, cluster name, and CloudFormation stack name.

#!/bin/bash

REGION_CODE=$1
EKSAZ1=$2
EKSAZ2=$3
CLUSTER_NAME=$4
STACKNAME=$5

LT_ID_TRN1=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
        --query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn1'].OutputValue" \
        --output text)

cat <<EOF > trn1_nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $REGION_CODE
  version: "1.28"

iam:
  withOIDC: true

availabilityZones: ["$EKSAZ1","$EKSAZ2"]

managedNodeGroups:
  - name: trn1-32xl-ng1
    launchTemplate:
      id: $LT_ID_TRN1
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    availabilityZones: ["$EKSAZ1"]
    privateNetworking: true
    efaEnabled: true
EOF
#!/bin/bash

REGION_CODE=$1
EKSAZ1=$2
EKSAZ2=$3
CLUSTER_NAME=$4
STACKNAME=$5

LT_ID_TRN2=$(aws cloudformation describe-stacks --stack-name $STACKNAME \
        --query "Stacks[0].Outputs[?OutputKey=='LaunchTemplateIdTrn2'].OutputValue" \
        --output text)

cat <<EOF > trn2_nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $REGION_CODE
  version: "1.34"

iam:
  withOIDC: true

availabilityZones: ["$EKSAZ1","$EKSAZ2"]

managedNodeGroups:
  - name: trn2-48xl-ng1
    launchTemplate:
      id: $LT_ID_TRN2
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    availabilityZones: ["$EKSAZ1"]
    privateNetworking: true
    efaEnabled: true
EOF

Run the script to generate the configuration file. Update the Kubernetes version as needed for your environment.

Example output:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: nemo2
  region: us-west-2
  version: "1.28"

iam:
  withOIDC: true

availabilityZones: ["us-west-2d","us-west-2c"]

managedNodeGroups:
  - name: trn1-32xl-ng1
    launchTemplate:
      id: lt-093c222b35ea89009
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    availabilityZones: ["us-west-2d"]
    privateNetworking: true
    efaEnabled: true
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: nemo2
  region: us-west-2
  version: "1.34"

iam:
  withOIDC: true

availabilityZones: ["us-west-2d","us-west-2c"]

managedNodeGroups:
  - name: trn2-48xl-ng1
    launchTemplate:
      id: lt-093c222b35ea89010
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    availabilityZones: ["us-west-2d"]
    privateNetworking: true
    efaEnabled: true

Step 6: Create Node Group

Create the node group using the generated configuration.

eksctl create nodegroup -f trn1_nodegroup.yaml
eksctl create nodegroup -f trn2_nodegroup.yaml

Wait for the nodes to reach the Ready state. Verify using:

kubectl get nodes

Step 7: Install EFA Device Plugin (Optional)

If you plan to run distributed training or inference jobs, install the EFA device plugin following the instructions at the EFA device plugin repository.

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3