This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Quickstart: Train a Model on Trainium#
This quickstart guides you through training your first PyTorch model on AWS Trainium. You’ll launch a Trn1 instance, install Neuron SDK, and run a simple training script. When you complete this quickstart, you’ll understand the basic workflow for training models with Neuron.
This quickstart is for: ML engineers and data scientists new to AWS Trainium
Time to complete: ~15 minutes
Prerequisites#
Before you begin, ensure you have:
An AWS account with EC2 launch permissions
AWS CLI configured with your credentials
SSH key pair for EC2 access
Basic familiarity with PyTorch
Terminal access (Linux, macOS, or WSL on Windows)
Step 1: Launch a Trainium instance#
In this step, you will launch a Trn1 instance using the AWS Deep Learning AMI.
First, launch a Trn1.2xlarge instance with the latest Deep Learning AMI:
aws ec2 run-instances \
--image-id resolve:ssm:/aws/service/deep-learning-base-neuron/ubuntu-22-04/latest \
--instance-type trn1.2xlarge \
--key-name YOUR_KEY_NAME \
--security-group-ids YOUR_SECURITY_GROUP \
--subnet-id YOUR_SUBNET_ID
Note
Replace YOUR_KEY_NAME, YOUR_SECURITY_GROUP, and YOUR_SUBNET_ID with your values.
Alternatively, launch the instance through the EC2 Console.
Once the instance is running, connect via SSH:
ssh -i YOUR_KEY.pem ubuntu@YOUR_INSTANCE_IP
Verify Neuron devices are available:
neuron-ls
You should see output showing available NeuronCores:
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON | PCI |
| DEVICE | CORES | MEMORY | BDF |
+--------+--------+--------+---------+
| 0 | 2 | 32 GB | 00:1e.0 |
| 1 | 2 | 32 GB | 00:1f.0 |
+--------+--------+--------+---------+
Step 2: Set up your environment#
In this step, you will create a Python virtual environment and install PyTorch with Neuron support.
Create and activate a virtual environment:
python3 -m venv neuron_env
source neuron_env/bin/activate
Install PyTorch Neuron and dependencies:
pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com
Verify the installation:
python -c "import torch; import torch_neuronx; print(f'PyTorch: {torch.__version__}')"
You should see output confirming PyTorch is installed:
PyTorch: 2.9.0+cpu
Step 3: Create a training script#
In this step, you will create a simple PyTorch training script that uses Neuron acceleration.
Create a file named train_simple.py:
import torch
import torch.nn as nn
import torch.optim as optim
import torch_neuronx
# Simple neural network
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
return self.fc2(x)
# Create model and move to Neuron device
model = SimpleNet().to('neuron')
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Generate dummy training data
batch_size = 32
num_batches = 100
print("Starting training...")
model.train()
for batch_idx in range(num_batches):
# Create dummy batch
inputs = torch.randn(batch_size, 784).to('neuron')
targets = torch.randint(0, 10, (batch_size,)).to('neuron')
# Training step
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0:
print(f"Batch {batch_idx}/{num_batches}, Loss: {loss.item():.4f}")
print("Training complete!")
This script creates a simple neural network, moves it to the Neuron device, and trains it on synthetic data.
Step 4: Run training#
In the final step, you will run the training script and monitor its progress.
Execute the training script:
python train_simple.py
You should see training progress output:
Starting training...
Batch 0/100, Loss: 2.3156
Batch 10/100, Loss: 2.2845
Batch 20/100, Loss: 2.2534
...
Training complete!
Monitor Neuron device utilization in another terminal:
neuron-top
This shows real-time NeuronCore utilization, memory usage, and other metrics.
Confirmation#
Congratulations! You’ve successfully trained your first model on AWS Trainium. You should have:
✅ Launched a Trn1 instance with Neuron SDK
✅ Installed PyTorch with Neuron support
✅ Created and ran a training script on Neuron devices
✅ Monitored training with Neuron tools
If you encountered any issues, see the Common issues section below.
Common issues#
Issue: ModuleNotFoundError: No module named 'torch_neuronx'
Solution: Ensure you activated the virtual environment and installed packages:
source neuron_env/bin/activate
pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com
Issue: RuntimeError: No Neuron devices found
Solution: Verify you’re on a Trainium instance and devices are visible:
neuron-ls
If no devices appear, check instance type and driver installation.
Issue: Training is slower than expected
Solution: This quickstart uses a small model for demonstration. For production workloads:
Use larger batch sizes
Enable XLA compilation with
torch.compile()See Developer Guide for Training with PyTorch NeuronX for optimization techniques
Clean up#
To avoid ongoing charges, terminate your instance when finished:
# From your local machine
aws ec2 terminate-instances --instance-ids YOUR_INSTANCE_ID
Or use the EC2 Console to terminate the instance.
Next steps#
Now that you’ve completed this quickstart, explore more advanced training topics:
Developer Guide for Training with PyTorch NeuronX - Comprehensive training guide
NxD Training - Distributed training with NeuronX Distributed
Model samples and tutorials - Pre-tested model samples
Neuron Explorer - Profile and optimize training performance
Further reading#
Install PyTorch for Neuron - Detailed PyTorch installation options
Amazon EC2 - EC2 deployment workflows
PyTorch Support on Neuron - Complete PyTorch Neuron documentation
This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3