This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

AWS Neuron Dynamic Resource Allocation (DRA)#

What is DRA?#

Prior to Kubernetes 1.33, Kubernetes used device plugins for resource management. The Neuron device plugin implements the device plugin interface to allow Kubernetes scheduler to manage Neuron resources. However, the device plugin framework only tracks device count—the scheduler cannot see device attributes. Due to this limitation, the framework cannot natively facilitate attribute-based filtering during device selection. For example, the default Kubernetes scheduler prior to DRA cannot support allocation of connected devices without additional mechanisms such as a scheduler extension.

Dynamic Resource Allocation (DRA) is a new framework for advanced resource management that addresses this limitation. DRA enables the scheduler to see the device attributes, allowing workloads to select devices based on specific attributes and achieve topology aware allocation. Hardware vendors determine which attributes are published for their hardware. The AWS Neuron DRA driver implements the kubelet plugin for DRA for AWS Trainium instances.

For more information on DRA, refer to Kubernetes Dynamic Resource Allocation.

Where can I get the Neuron DRA driver and resource templates?#

To review and download the individual resource claim templates, visit this page:

AWS Neuron Dynamic Resource Allocation (DRA) on Kubernetes: Support files.

What are the benefits of using DRA over device plugin?#

Reduced developer complexity

Device plugin-based workloads use node labels along with request and limits to allocate right resources. Example:

Worker:
  replicas: 4
  template:
    spec:
      containers:
      - image: <aws-account-id>.dkr.ecr.us-west-2.amazonaws.com/neuronx_nemo:latest
        name: mpitest
        imagePullPolicy: Always
        resources:
          limits:
            aws.amazon.com/neuron: "16"
            vpc.amazonaws.com/efa: "16"
          requests:
            aws.amazon.com/neuron: "16"
            vpc.amazonaws.com/efa: "16"
        volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory

DRA introduces ResourceClaim and ResourceClaimTemplates which provide abstraction:

Worker:
  replicas: 4
  template:
    spec:
      containers:
      - image: <aws-account-id>.dkr.ecr.us-west-2.amazonaws.com/neuronx_nemo:latest
        name: mpitest
        imagePullPolicy: Always
        resources:
          claims:
          - name: neurons
        volumeMounts:
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      resourceClaims:
      - name: neurons
        resourceClaimTemplateName: efa-neurons-4-devices

The ResourceClaimTemplate name is a given name and can be defined by the ML infra operators to be friendly to their developers. The RCT definition translates the name into the underlying allocation details - these are abstracted away from ML developers.

Rich interface for resource requests

With DRA, resource requests can specify attribute-based selection. For example, RCT can follow requests, which was not possible to do with device plugins without additional node labeling and extensions. This interface allows us to facilitate topology-aware scheduling.

Allocate connected neuron devices from trn2 instance type and the devices in the set need to be running specified Neuron driver version.
Allocate a specific set of neuron devices for my pod - I want the pod to use devices in row 1 of the topology.

Dynamic configuration

DRA allows end users to specify additional configuration for the device via RCT. The Neuron DRA driver leverages this capability to allow ResourceClaimTemplates to specify LNC size to be used for the allocation. An example is shown below. The end user need not configure LNC via launch template while using Neuron devices with Neuron DRA driver.

#Template will be vended by Neuron via documentation/code repo
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  namespace: neuron-test7
  name: lnc-neurons
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: device.attributes['neuron.aws.com'].instanceType == "trn2.48xlarge"
          allocationMode: all
      config:
      - opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: NeuronConfig
            logicalNeuronCore: 1
        requests: ["neurons"]

Prerequisites#

Kubernetes version - Please use K8s control plane 1.34+
Instance type - Trn2.48xlarge launched with K8s version 1.34.2+

For instructions on how to setup an EKS cluster, please refer to prerequisites.

Installation via Helm#

Connect to your cluster from local box. The cluster should have at least one trn2.48xlarge node. Do not install the Neuron device plugin on the cluster!

Please confirm the cluster being used via:

kubectl config current-context

Then install the DRA driver:

helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
  --set "devicePlugin.enabled=false" --set "npd.enabled=false" --set "draDriver.enabled=true"

Example 1 – Connected Neuron Devices#

This section will demonstrate how to run a workload that needs to request a subset of connected Neuron Devices from a trn2.48xlarge instance. Before DRA, this use case required using Neuron Scheduler Extension. With DRA, this allocation is enabled natively.

[Download example YAML file]

The supported subsets include set of 1, 4, 8 or 16. Specifically, these are resource.aws.com/devicegroup1_id, resource.aws.com/devicegroup4_id, resource.aws.com/devicegroup8_id, resource.aws.com/devicegroup16_id respectively.

The sets of 4 and 8 are selected as shown in diagram below:

To enable a workload to consume a connected subset of Neuron Devices, first create a ResourceClaimTemplate that requests a connected set of Neuron devices. From the package run:

kubectl apply -f specs/1x4-connected-devices.yaml

This workload definition (which includes the ResourceClaimTemplate) is shown below for quick reference:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: 1x4-connected-neurons
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          allocationMode: ExactCount
          count: 4
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
      constraints:
      - requests: ["neurons"]
        matchAttribute: "resource.aws.com/devicegroup4_id"

Next step is to reference the ResourceClaimTemplate in a pod definition as shown below:

---
apiVersion: v1
kind: Pod
metadata:
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: public.ecr.aws/ubuntu/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: 1x4-connected-neurons

Deploy the above workload using kubectl apply. When the pod is running, examine the related ResourceClaim using:

kubectl get resourceclaim -o yaml

The resourceclaim output will show the 4 Neuron Devices that were allocated to the pod. An example is shown below. These will be connected Neuron Devices.

[devbox]$ kubectl get pod

NAME   READY   STATUS    RESTARTS   AGE
---------------------------------------
pod0   1/1     Running   0          3s

[devbox]$ kubectl get resourceclaim

NAME                 STATE                AGE
---------------------------------------------
pod0-neurons-zdk76   allocated,reserved   9s

[devbox]$ kubectl get resourceclaim pod0-neurons-zdk76 -o yaml

Status shown below:

status:
  allocation:
    devices:
      results:
      - adminAccess: null
        device: neurondevice2
        driver: neuron.aws.com
        pool: ip-1-1-1-1.region.compute.internal
        request: neurons
      - adminAccess: null
        device: neurondevice3
        driver: neuron.aws.com
        pool: ip-1-1-1-1.region.compute.internal
        request: neurons
      - adminAccess: null
        device: neurondevice1
        driver: neuron.aws.com
        pool: ip-1-1-1-1.region.compute.internal
        request: neurons
      - adminAccess: null
        device: neurondevice0
        driver: neuron.aws.com
        pool: ip-1-1-1-1.region.compute.internal
        request: neurons

Note

The RCT name can be simplified to communicate the intent of the allocation and abstract the allocation details away from ML developers.

Example RCT1 - “xl” - Allocate All 16 devices

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: xl-trn2
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          allocationMode: ExactCount
          count: 16
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'

Example RCT2 - large - Allocate 8 devices

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: l-trn2
spec:
  spec:
    devices:
      constraints:
      - matchAttribute: resource.aws.com/devicegroup8_id
        requests:
        - neurons
      requests:
      - name: neurons
        exactly:
          allocationMode: ExactCount
          count: 8
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'

Example RCT2 - 2.27-driver – Allocate 8 devices with driver version at the driver published by Neuron SDK 2.27

Neuron 2.27.0 Runtime

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: 2.27-driver-trn2
spec:
  spec:
    devices:
      constraints:
      - matchAttribute: resource.aws.com/devicegroup8_id
        requests:
        - neurons
      requests:
      - name: neurons
        exactly:
          allocationMode: ExactCount
          count: 8
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge' &&
                         device.attributes['neuron.aws.com'].neuronDriverVersion == '2.25.4.0'

Example 2 - Dynamic LNC config#

This example shows how to set LNC per workload. Earlier, overriding LNC on a Node required a node template. With DRA, workloads can override default LNC via ResourceClaim.

[Download example YAML file]

Apply the following workload definition:

kubectl apply -f specs/lnc-setting-trn2.yaml

This workload definition (which includes the ResourceClaimTemplate) is shown below for quick reference:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-neurons-lnc-1
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].instanceType == 'trn2.48xlarge'"
          allocationMode: All
      config:
      - requests: ["neurons"]
        opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: NeuronConfig
            logicalNeuronCore: 1

Then deploy a pod that references the above ResourceClaimTemplate as shown below:

apiVersion: v1
kind: Pod
metadata:
  name: pod0
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: public.ecr.aws/ubuntu/ubuntu:22.04
    command: ["bash", "-c"]
    args: ["export; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: neurons
  resourceClaims:
  - name: neurons
    resourceClaimTemplateName: all-neurons-lnc-1

Example 3 – Four Node Inference on trn2u.48xlarge#

A trn2u.48xlarge Trn2 UltraServer has 4 Trn2 nodes interconnected by Neuron Links.

trn2u.48xlarge instances can be allocated in set of 1, 2, or 4. The Neuron DRA driver can utilize 1 or more ResourceClaimTemplate definitions to convey the desired size of the set. The ResourceClaimTemplate allows end users to specify “UltraServerConfig” to declare their intent to use all 4 nodes of the UltraServer. This configuration value is passed by the Neuron DRA driver to the Neuron runtime and collectives inside the container.

[Download example YAML file]

Example yaml for 4-node inference on trn2u.48xlarge:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: us-4-node-config
spec:
  spec:
    devices:
      requests:
      - name: neurons
        exactly:
          deviceClassName: neuron.aws.com
          selectors:
          - cel:
              expression: "device.attributes['neuron.aws.com'].resourceType == 'neuron_node'"
          allocationMode: ExactCount
          count: 1
      config:
      - requests: ["neurons"]
        opaque:
          driver: neuron.aws.com
          parameters:
            apiVersion: neuron.aws.com/v1
            kind: UltraServerConfig
            ultraserverMode: 4
---
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
  annotations:
    leaderworkerset.sigs.k8s.io/exclusive-topology: neuron.amazonaws.com/ultraserver-server-id-4
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 1
  # Two replica groups of 4 nodes each, i.e. two ultraservers
  replicas: 2
  leaderWorkerTemplate:
    size: 4
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
        - name: vllm-leader
          image: public.ecr.aws/ubuntu/ubuntu:22.04
          command:
          - sh
          - -c
          - "sleep infinity"
          resources:
            claims:
            - name: one-node-from-ultraserver
        resourceClaims:
        - name: one-node-from-ultraserver
          resourceClaimTemplateName: us-4-node-config
    workerTemplate:
      metadata:
        labels:
          role: worker
      spec:
        containers:
        - name: vllm-worker
          image: public.ecr.aws/ubuntu/ubuntu:22.04
          command:
          - sh
          - -c
          - "sleep infinity"
          resources:
            claims:
            - name: one-node-from-ultraserver
        resourceClaims:
        - name: one-node-from-ultraserver
          resourceClaimTemplateName: us-4-node-config

Neuron DRA Driver Attributes Reference#

The Neuron DRA driver publishes the following attributes in resource slices. These attributes can be used in ResourceClaimTemplate CEL expressions to filter and select specific devices for allocation.

Common Attributes#

These attributes are common to all Neuron instances and their devices:

deviceId - An integer value representing the ID of the Neuron device. Used to identify which device is chosen from allocation.
instanceType - A string value representing the EC2 instance type of the Neuron device. Used to specify devices of which instance(s) to choose for allocation.
neuronDriverVersion - A string value representing the Neuron driver version running on the instance. Used to claim instances with the same driver version for allocation.
draDriverVersion - A version value of the Neuron DRA driver version. Provides visibility on which Neuron DRA driver version published the resource slice.
resourceType - A string value to distinguish between devices and UltraServer nodes. For devices, this value is neuron_device. For UltraServers, this value is neuron_node.
networkNodeLayer1 - A string value representing network node layer 1. Can be used during topology-aware scheduling to minimize network latency and optimize instance placement. See EC2 Instance Topology.
networkNodeLayer2 - A string value representing network node layer 2. Can be used to allocate workloads to nodes on the same spine. See EC2 Instance Topology.
networkNodeLayer3 - A string value representing network node layer 3. Can be used during topology-aware scheduling to minimize network latency and optimize instance placement. See EC2 Instance Topology.

Trn Non-UltraServer Attributes#

These attributes are only populated for Neuron instances that have grid topology (trn) and are not UltraServers:

topology_x - An integer value representing the row of the device in a grid topology. Only populated when the number of devices in the instance is greater than 1. Can be used to select a specific device or devices that belong to the same row.
topology_y - An integer value representing the column of the device in a grid topology. Only populated when the number of devices in the instance is greater than 1. Can be used to select a specific device or devices that belong to the same column.
topology4_id - An integer value representing the row of the device in a grid topology. Only populated when the number of devices in the instance is greater than 1. Can be used to select devices that belong to the same row.
topology8_id - An integer value representing the row of the device in a grid topology. Only populated when the number of devices in the instance is greater than or equal to 8. Can be used to select devices that belong to the same two rows.

Trn UltraServer Attributes#

These attributes are only populated for Neuron instances that have grid topology (trn) and are UltraServers:

capacityBlockId - A string value representing the ID of the capacity block that the UltraServer instance is in. See Instance Topology API.

EFA-Enabled Instance Attributes#

These attributes are only populated for Neuron instances that are EFA-enabled:

resource.aws.com/devicegroup1_id - A string value representing the EFA Bus:Device:Function (BDF) corresponding to that device.
resource.aws.com/devicegroup4_id - A string value representing a hash, ensuring Neuron devices in the same topology group of 4 get the same group ID.
resource.aws.com/devicegroup8_id - A string value representing a hash, ensuring Neuron devices in the same topology group of 8 get the same group ID.
resource.aws.com/devicegroup16_id - A string value representing a hash, ensuring Neuron devices in the same topology group of 16 get the same group ID.

FAQs#

Can DRA plugin co-exist with other device plugins?#

Device plugins and the DRA plugin can coexist in the same cluster, but not for the same node. As of now, the two mechanisms act independently. Neuron is preparing an upcoming feature that will allow device plugin based allocations to work with DRA, but the feature is still in alpha and not enabled on EKS. Ref: Extended Resource.

Is DRA replacing Neuron Device Plugin and Scheduler Extension?#

We will continue to support the Neuron Device Plugin and Scheduler Extension as long as:

Upstream Kubernetes continues to support device plugins.
EKS continues to support Kubernetes versions below 1.34 (which do not support DRA).

What Kubernetes versions are supported?#

Kubernetes control plane must be on 1.34. For Node AMI, we support 1.34.2+. We do not support Node AMI for 1.34.0 or 1.34.1 since it had a regression in DRA. Upstream issue: Kubernetes Issue #133920

Where can I learn more about how to put together RCT using CEL expressions?#

To learn more about RCTs, please visit Kubernetes Dynamic Resource Allocation. To learn more about CEL expressions, please visit CEL Language. Send us feedback and let us know which additional RCT examples you would like us to provide in the source code.

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

AWS Neuron Dynamic Resource Allocation (DRA)

Contents

AWS Neuron Dynamic Resource Allocation (DRA)#

What is DRA?#

Where can I get the Neuron DRA driver and resource templates?#

What are the benefits of using DRA over device plugin?#

Prerequisites#

Installation via Helm#

Example 1 – Connected Neuron Devices#

Example 2 - Dynamic LNC config#

Example 3 – Four Node Inference on trn2u.48xlarge#

Neuron DRA Driver Attributes Reference#

Common Attributes#

Trn Non-UltraServer Attributes#

Trn UltraServer Attributes#

EFA-Enabled Instance Attributes#

FAQs#

Can DRA plugin co-exist with other device plugins?#

Is DRA replacing Neuron Device Plugin and Scheduler Extension?#

What Kubernetes versions are supported?#

Where can I learn more about how to put together RCT using CEL expressions?#