This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Permissions for Neuron Node Problem Detector and Recovery#

The Neuron Node Problem Detector and Recovery requires IAM roles for service accounts (IRSA) for authorization. For more information, see IAM roles for service accounts in the Amazon EKS User Guide.

This section shows how to configure an IAM role for service accounts using the eksctl command-line tool.

Step 1: Install eksctl

Install the eksctl CLI using the instructions at https://eksctl.io/installation/.

Step 2: Create IAM Policy

Create an IAM policy that grants the necessary permissions for the Neuron Node Problem Detector.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:SetInstanceHealth",
                "autoscaling:DescribeAutoScalingInstances"
            ],
            "Effect": "Allow",
            "Resource": "<arn of the Auto Scaling group corresponding to the Neuron nodes for the cluster>"
        },
        {
            "Action": [
                "ec2:DescribeInstances"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "ForAllValues:StringEquals": {
                    "ec2:ResourceTag/aws:autoscaling:groupName": "<name of the Auto Scaling group corresponding to the Neuron nodes for the cluster>"
                }
            }
        },
        {
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:Namespace": "NeuronHealthCheck"
                }
            }
        }
    ]
}

Save the policy template above to a file named npd-policy.json (replacing the placeholder values), then run:

aws iam create-policy \
    --policy-name NeuronProblemDetectorPolicy \
    --policy-document file://npd-policy.json

Step 3: Create Namespace and Service Account

Create a dedicated namespace for the Neuron Node Problem Detector:

kubectl create ns neuron-healthcheck-system

Step 4: Associate IAM Role with Service Account

Use the following script to create the service account and associate it with the IAM role:

#!/bin/bash
CLUSTER_NAME=<eks cluster name>
REGION_CODE=$(aws configure get region)
POLICY_ARN=<policy arn for NeuronProblemDetectorPolicy>

eksctl create iamserviceaccount \
    --name node-problem-detector \
    --namespace neuron-healthcheck-system \
    --cluster $CLUSTER_NAME \
    --attach-policy-arn $POLICY_ARN \
    --approve \
    --role-name neuron-problem-detector-role-$CLUSTER_NAME \
    --region $REGION_CODE \
    --override-existing-serviceaccounts

Step 5: Verify Service Account Configuration

Verify that the service account is annotated correctly with the IAM role:

kubectl describe sa node-problem-detector -n neuron-healthcheck-system

Expected output:

Name:                node-problem-detector
Namespace:           neuron-healthcheck-system
Labels:              app.kubernetes.io/managed-by=eksctl
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/neuron-problem-detector-role-cluster1
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

Cleanup

To remove the service account and associated IAM role, use the following command:

#!/bin/bash
CLUSTER_NAME=<eks cluster name>
REGION_CODE=$(aws configure get region)

eksctl delete iamserviceaccount \
    --name node-problem-detector \
    --namespace neuron-healthcheck-system \
    --cluster $CLUSTER_NAME \
    --approve \
    --region $REGION_CODE

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3