This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Quickstart: Train a Model on Trainium#

This quickstart guides you through training your first PyTorch model on AWS Trainium. You’ll launch a Trn1 instance, install Neuron SDK, and run a simple training script. When you complete this quickstart, you’ll understand the basic workflow for training models with Neuron.

This quickstart is for: ML engineers and data scientists new to AWS Trainium

Time to complete: ~15 minutes

Prerequisites#

Before you begin, ensure you have:

  • An AWS account with EC2 launch permissions

  • AWS CLI configured with your credentials

  • SSH key pair for EC2 access

  • Basic familiarity with PyTorch

  • Terminal access (Linux, macOS, or WSL on Windows)

Step 1: Launch a Trainium instance#

In this step, you will launch a Trn1 instance using the AWS Deep Learning AMI.

First, launch a Trn1.2xlarge instance with the latest Deep Learning AMI:

aws ec2 run-instances \
    --image-id resolve:ssm:/aws/service/deep-learning-base-neuron/ubuntu-22-04/latest \
    --instance-type trn1.2xlarge \
    --key-name YOUR_KEY_NAME \
    --security-group-ids YOUR_SECURITY_GROUP \
    --subnet-id YOUR_SUBNET_ID

Note

Replace YOUR_KEY_NAME, YOUR_SECURITY_GROUP, and YOUR_SUBNET_ID with your values.

Alternatively, launch the instance through the EC2 Console.

Once the instance is running, connect via SSH:

ssh -i YOUR_KEY.pem ubuntu@YOUR_INSTANCE_IP

Verify Neuron devices are available:

neuron-ls

You should see output showing available NeuronCores:

+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 |
| 1      | 2      | 32 GB  | 00:1f.0 |
+--------+--------+--------+---------+

Step 2: Set up your environment#

In this step, you will create a Python virtual environment and install PyTorch with Neuron support.

Create and activate a virtual environment:

python3 -m venv neuron_env
source neuron_env/bin/activate

Install PyTorch Neuron and dependencies:

pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Verify the installation:

python -c "import torch; import torch_neuronx; print(f'PyTorch: {torch.__version__}')"

You should see output confirming PyTorch is installed:

PyTorch: 2.9.0+cpu

Step 3: Create a training script#

In this step, you will create a simple PyTorch training script that uses Neuron acceleration.

Create a file named train_simple.py:

import torch
import torch.nn as nn
import torch.optim as optim
import torch_neuronx

# Simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

# Create model and move to Neuron device
model = SimpleNet().to('neuron')
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Generate dummy training data
batch_size = 32
num_batches = 100

print("Starting training...")
model.train()

for batch_idx in range(num_batches):
    # Create dummy batch
    inputs = torch.randn(batch_size, 784).to('neuron')
    targets = torch.randint(0, 10, (batch_size,)).to('neuron')

    # Training step
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

    if batch_idx % 10 == 0:
        print(f"Batch {batch_idx}/{num_batches}, Loss: {loss.item():.4f}")

print("Training complete!")

This script creates a simple neural network, moves it to the Neuron device, and trains it on synthetic data.

Step 4: Run training#

In the final step, you will run the training script and monitor its progress.

Execute the training script:

python train_simple.py

You should see training progress output:

Starting training...
Batch 0/100, Loss: 2.3156
Batch 10/100, Loss: 2.2845
Batch 20/100, Loss: 2.2534
...
Training complete!

Monitor Neuron device utilization in another terminal:

neuron-top

This shows real-time NeuronCore utilization, memory usage, and other metrics.

Confirmation#

Congratulations! You’ve successfully trained your first model on AWS Trainium. You should have:

  • ✅ Launched a Trn1 instance with Neuron SDK

  • ✅ Installed PyTorch with Neuron support

  • ✅ Created and ran a training script on Neuron devices

  • ✅ Monitored training with Neuron tools

If you encountered any issues, see the Common issues section below.

Common issues#

Issue: ModuleNotFoundError: No module named 'torch_neuronx'

Solution: Ensure you activated the virtual environment and installed packages:

source neuron_env/bin/activate
pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Issue: RuntimeError: No Neuron devices found

Solution: Verify you’re on a Trainium instance and devices are visible:

neuron-ls

If no devices appear, check instance type and driver installation.

Issue: Training is slower than expected

Solution: This quickstart uses a small model for demonstration. For production workloads:

Clean up#

To avoid ongoing charges, terminate your instance when finished:

# From your local machine
aws ec2 terminate-instances --instance-ids YOUR_INSTANCE_ID

Or use the EC2 Console to terminate the instance.

Next steps#

Now that you’ve completed this quickstart, explore more advanced training topics:

Further reading#

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3