This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Quickstart: Train a Model on Trainium#

This quickstart guides you through training your first PyTorch model on AWS Trainium. You’ll launch a Trn1 instance, install Neuron SDK, and run a simple training script. When you complete this quickstart, you’ll understand the basic workflow for training models with Neuron.

This quickstart is for: ML engineers and data scientists new to AWS Trainium

Time to complete: ~15 minutes

Prerequisites#

Before you begin, ensure you have:

An AWS account with EC2 launch permissions
AWS CLI configured with your credentials
SSH key pair for EC2 access
Basic familiarity with PyTorch
Terminal access (Linux, macOS, or WSL on Windows)

Step 1: Launch a Trainium instance#

In this step, you will launch a Trn1 instance using the AWS Deep Learning AMI.

First, launch a Trn1.2xlarge instance with the latest Deep Learning AMI:

aws ec2 run-instances \
    --image-id resolve:ssm:/aws/service/deep-learning-base-neuron/ubuntu-22-04/latest \
    --instance-type trn1.2xlarge \
    --key-name YOUR_KEY_NAME \
    --security-group-ids YOUR_SECURITY_GROUP \
    --subnet-id YOUR_SUBNET_ID

Note

Replace YOUR_KEY_NAME, YOUR_SECURITY_GROUP, and YOUR_SUBNET_ID with your values.

Alternatively, launch the instance through the EC2 Console.

Once the instance is running, connect via SSH:

ssh -i YOUR_KEY.pem ubuntu@YOUR_INSTANCE_IP

Verify Neuron devices are available:

neuron-ls

You should see output showing available NeuronCores:

+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 |
| 1      | 2      | 32 GB  | 00:1f.0 |
+--------+--------+--------+---------+

Step 2: Set up your environment#

In this step, you will create a Python virtual environment and install PyTorch with Neuron support.

Create and activate a virtual environment:

python3 -m venv neuron_env
source neuron_env/bin/activate

Install PyTorch Neuron and dependencies:

pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Verify the installation:

python -c "import torch; import torch_neuronx; print(f'PyTorch: {torch.__version__}')"

You should see output confirming PyTorch is installed:

PyTorch: 2.9.0+cpu

Step 3: Create a training script#

In this step, you will create a simple PyTorch training script that uses Neuron acceleration.

Create a file named train_simple.py:

import torch
import torch.nn as nn
import torch.optim as optim
import torch_neuronx

# Simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

# Create model and move to Neuron device
model = SimpleNet().to('neuron')
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Generate dummy training data
batch_size = 32
num_batches = 100

print("Starting training...")
model.train()

for batch_idx in range(num_batches):
    # Create dummy batch
    inputs = torch.randn(batch_size, 784).to('neuron')
    targets = torch.randint(0, 10, (batch_size,)).to('neuron')

    # Training step
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    if batch_idx % 10 == 0:
        print(f"Batch {batch_idx}/{num_batches}, Loss: {loss.item():.4f}")

print("Training complete!")

This script creates a simple neural network, moves it to the Neuron device, and trains it on synthetic data.

Step 4: Run training#

In the final step, you will run the training script and monitor its progress.

Execute the training script:

python train_simple.py

You should see training progress output:

Starting training...
Batch 0/100, Loss: 2.3156
Batch 10/100, Loss: 2.2845
Batch 20/100, Loss: 2.2534
...
Training complete!

Monitor Neuron device utilization in another terminal:

neuron-top

This shows real-time NeuronCore utilization, memory usage, and other metrics.

Confirmation#

Congratulations! You’ve successfully trained your first model on AWS Trainium. You should have:

✅ Launched a Trn1 instance with Neuron SDK
✅ Installed PyTorch with Neuron support
✅ Created and ran a training script on Neuron devices
✅ Monitored training with Neuron tools

If you encountered any issues, see the Common issues section below.

Common issues#

Issue: ModuleNotFoundError: No module named 'torch_neuronx'

Solution: Ensure you activated the virtual environment and installed packages:

source neuron_env/bin/activate
pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Issue: RuntimeError: No Neuron devices found

Solution: Verify you’re on a Trainium instance and devices are visible:

neuron-ls

If no devices appear, check instance type and driver installation.

Issue: Training is slower than expected

Solution: This quickstart uses a small model for demonstration. For production workloads:

Use larger batch sizes
Enable XLA compilation with torch.compile()
See Developer Guide for Training with PyTorch NeuronX for optimization techniques

Clean up#

To avoid ongoing charges, terminate your instance when finished:

# From your local machine
aws ec2 terminate-instances --instance-ids YOUR_INSTANCE_ID

Or use the EC2 Console to terminate the instance.

Next steps#

Now that you’ve completed this quickstart, explore more advanced training topics:

Developer Guide for Training with PyTorch NeuronX - Comprehensive training guide
NxD Training - Distributed training with NeuronX Distributed
Model samples and tutorials - Pre-tested model samples
Neuron Explorer - Profile and optimize training performance

Quickstart: Train a Model on Trainium

Contents

Quickstart: Train a Model on Trainium#

Prerequisites#

Step 1: Launch a Trainium instance#

Step 2: Set up your environment#

Step 3: Create a training script#

Step 4: Run training#

Confirmation#

Common issues#

Clean up#

Next steps#

Further reading#