This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Containers - Tutorials#

Learn how to deploy and manage AWS Neuron workloads in containerized environments. These tutorials cover everything from building Docker containers with Neuron support to deploying production-ready Kubernetes clusters with device plugins, schedulers, and monitoring solutions. Whether you’re running inference or training workloads on AWS Trainium or Inferentia instances, these step-by-step guides will help you configure your container infrastructure for optimal performance and reliability.

General Container Tutorials#

Docker Environment Setup

Configure Docker on Amazon Linux 2023 to expose Inferentia and Trainium devices to containers. Install Neuron drivers, runtime, and configure the Docker daemon for Neuron device access.

Build and Run Neuron Containers

Learn how to build Docker images with Neuron support using provided Dockerfiles and run containerized applications on Inf1 and Trn1 instances with proper device exposure.

Docker Neuron OCI Hook Setup

Install and configure the Neuron OCI hook to enable the AWS_NEURON_VISIBLE_DEVICES environment variable for exposing all Neuron devices to containers without explicit device flags.

Kubernetes Setup and Configuration#

Kubernetes Environment Setup

Complete guide to setting up Kubernetes for Neuron, including EKS cluster creation with Trainium nodes, device plugin installation, scheduler extension setup, and resource allocation configuration.

Neuron Helm Chart

Simplify Neuron infrastructure deployment with the unified Helm chart that installs device plugins, scheduler extensions, node problem detector, and DRA driver in a single command.

Kubernetes Device Management#

Scheduler Flow Diagram

Visual diagram showing how the Neuron Scheduler Extension integrates with Kubernetes components to schedule Pods with Neuron resource requests.

Kubernetes Monitoring and Recovery#

Neuron Monitor

Deploy Neuron Monitor to collect and expose metrics from Neuron devices and runtime. Integrate with Prometheus for observability, performance tracking, and troubleshooting.

Node Problem Detector and Recovery

Monitor Neuron device health and automatically remediate issues by detecting hardware failures, driver problems, and runtime errors. Enable automatic node replacement for faulty hardware.

NPD Permissions (IRSA)

Configure IAM roles for service accounts (IRSA) to grant the Neuron Node Problem Detector necessary permissions for Auto Scaling group operations and CloudWatch metrics.

Training and Inference Container Tutorials#

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3