This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Quickstart: Run Inference on Inferentia#

This quickstart guides you through running your first PyTorch inference workload on AWS Inferentia. You’ll launch an Inf2 instance, compile a model for Neuron, and run predictions. When you complete this quickstart, you’ll understand the basic workflow for deploying models on Inferentia.

This quickstart is for: ML engineers and developers deploying inference workloads

Time to complete: ~10 minutes

Prerequisites#

Before you begin, ensure you have:

  • An AWS account with EC2 launch permissions

  • AWS CLI configured with your credentials

  • SSH key pair for EC2 access

  • Basic familiarity with PyTorch

  • Terminal access (Linux, macOS, or WSL on Windows)

Step 1: Launch an Inferentia instance#

In this step, you will launch an Inf2 instance using the AWS Deep Learning AMI.

Launch an Inf2.xlarge instance with the latest Deep Learning AMI:

aws ec2 run-instances \
    --image-id resolve:ssm:/aws/service/deep-learning-base-neuron/ubuntu-22-04/latest \
    --instance-type inf2.xlarge \
    --key-name YOUR_KEY_NAME \
    --security-group-ids YOUR_SECURITY_GROUP \
    --subnet-id YOUR_SUBNET_ID

Note

Replace YOUR_KEY_NAME, YOUR_SECURITY_GROUP, and YOUR_SUBNET_ID with your values.

Alternatively, launch the instance through the EC2 Console.

Connect to your instance via SSH:

ssh -i YOUR_KEY.pem ubuntu@YOUR_INSTANCE_IP

Verify Neuron devices are available:

neuron-ls

You should see output showing available NeuronCores:

+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 |
+--------+--------+--------+---------+

Step 2: Set up your environment#

In this step, you will create a Python virtual environment and install PyTorch with Neuron support.

Create and activate a virtual environment:

python3 -m venv neuron_env
source neuron_env/bin/activate

Install PyTorch Neuron and dependencies:

pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Verify the installation:

python -c "import torch; import torch_neuronx; print(f'PyTorch: {torch.__version__}')"

You should see output confirming PyTorch is installed:

PyTorch: 2.9.0+cpu

Step 3: Compile a model for Neuron#

In this step, you will create a simple model and compile it for Neuron inference.

Create a file named compile_model.py:

import torch
import torch.nn as nn
import torch_neuronx

# Simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

# Create model and set to eval mode
model = SimpleNet()
model.eval()

# Create example input
example_input = torch.randn(1, 784)

# Trace and compile for Neuron
print("Compiling model for Neuron...")
neuron_model = torch_neuronx.trace(model, example_input)

# Save compiled model
neuron_model.save('simple_net_neuron.pt')
print("Model compiled and saved to simple_net_neuron.pt")

Run the compilation script:

python compile_model.py

You should see compilation progress and success message:

Compiling model for Neuron...
INFO:Neuron:Compiling function _NeuronGraph$1 with neuronx-cc
INFO:Neuron:Compilation successful
Model compiled and saved to simple_net_neuron.pt

Note

Model compilation happens once. The compiled model (simple_net_neuron.pt) can be reused for inference without recompiling.

Step 4: Run inference#

In the final step, you will load the compiled model and run predictions.

Create a file named run_inference.py:

import torch
import torch_neuronx

# Load compiled model
print("Loading compiled model...")
neuron_model = torch.jit.load('simple_net_neuron.pt')

# Create sample input
sample_input = torch.randn(1, 784)

# Run inference
print("Running inference...")
with torch.no_grad():
    output = neuron_model(sample_input)

# Get prediction
predicted_class = output.argmax(dim=1).item()
print(f"Predicted class: {predicted_class}")
print(f"Output logits: {output[0][:5].tolist()}")  # Show first 5 logits

# Run multiple inferences to measure throughput
print("\nRunning 100 inferences...")
import time
start = time.time()

with torch.no_grad():
    for _ in range(100):
        output = neuron_model(sample_input)

elapsed = time.time() - start
throughput = 100 / elapsed
print(f"Throughput: {throughput:.2f} inferences/second")
print(f"Latency: {elapsed/100*1000:.2f} ms per inference")

Run the inference script:

python run_inference.py

You should see inference results:

Loading compiled model...
Running inference...
Predicted class: 7
Output logits: [0.123, -0.456, 0.789, -0.234, 0.567]

Running 100 inferences...
Throughput: 245.67 inferences/second
Latency: 4.07 ms per inference

Monitor Neuron device utilization in another terminal:

neuron-top

This shows real-time NeuronCore utilization and inference metrics.

Confirmation#

Congratulations! You’ve successfully run inference on AWS Inferentia. You should have:

  • ✅ Launched an Inf2 instance with Neuron SDK

  • ✅ Installed PyTorch with Neuron support

  • ✅ Compiled a model for Neuron inference

  • ✅ Ran predictions and measured throughput

  • ✅ Monitored inference with Neuron tools

If you encountered any issues, see the Common issues section below.

Common issues#

Issue: ModuleNotFoundError: No module named 'torch_neuronx'

Solution: Ensure you activated the virtual environment and installed packages:

source neuron_env/bin/activate
pip install torch-neuronx neuronx-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com

Issue: RuntimeError: No Neuron devices found

Solution: Verify you’re on an Inferentia instance and devices are visible:

neuron-ls

If no devices appear, check instance type and driver installation.

Issue: Compilation takes a long time

Solution: Model compilation is a one-time cost. For this simple model, compilation should take 1-2 minutes. Larger models take longer but only need to be compiled once. The compiled model can be saved and reused.

Issue: Lower throughput than expected

Solution: This quickstart uses a small model and batch size for demonstration. For production workloads:

  • Use larger batch sizes (e.g., 4, 8, 16)

  • Enable dynamic batching

  • Use multiple NeuronCores in parallel

  • See Developer Guide (torch-neuronx) for optimization techniques

Clean up#

To avoid ongoing charges, terminate your instance when finished:

# From your local machine
aws ec2 terminate-instances --instance-ids YOUR_INSTANCE_ID

Or use the EC2 Console to terminate the instance.

Next steps#

Now that you’ve completed this quickstart, explore more advanced inference topics:

Further reading#

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3