This document is relevant for: Inf2, Trn1, Trn2, Trn3
NeuronX Runtime
The NeuronX Runtime is a high-performance execution engine that enables deep learning models to run on AWS Inferentia and Trainium accelerators. It consists of a kernel driver and C/C++ libraries that provide low-level APIs for accessing Neuron devices, managing model execution, and coordinating collective communications across NeuronCores.
The Neuron Runtime serves as the foundation for all ML framework integrations (TensorFlow, PyTorch, JAX, and Apache MXNet), loading compiled models in Neuron Executable File Format (NEFF) and orchestrating their execution on Neuron hardware. It is optimized for high-throughput and low-latency inference and training workloads, with features including:
Efficient model execution: Loads and executes NEFF files on NeuronCores with optimized memory management
Multi-model support: Manages multiple models across multiple NeuronCores with flexible allocation strategies
Collective communications: Provides high-performance collective operations for distributed training and inference
Device management: Handles NeuronCore allocation, device discovery, and resource management
Debugging support: Offers core dump generation, debug streams, and detailed logging for troubleshooting
Configuration flexibility: Extensive environment variables for fine-tuning runtime behavior
The Neuron Runtime is typically used transparently through ML framework plugins, but also provides direct C/C++ APIs for developers building custom frameworks or requiring low-level device control.
Get Started
About the NeuronX Runtime
Learn about the AWS Neuron Runtime, its features, and capabilities for accessing Inferentia and Trainium Neuron devices.
Quickstart: Generate a Core Dump
Learn how to generate a Neuron runtime core dump for debugging runtime failures and analyzing device state.
How-to guides
Runtime developer guide
Build a C/C++ application against libnrt directly. Covers runtime architecture, driver and library installation, NEFF loading, tensor staging, execution, and the collective communication library.
Migrate to the explicit async APIs
Move a C/C++ application off the legacy NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS implicit async mode and onto the nrta_* explicit async APIs for fine-grained scheduling and completion tracking.
Runtime configuration guide
Configure the Neuron Runtime through environment variables. Covers NeuronCore visibility and allocation, execution timeouts, logging verbosity, and other runtime knobs.
All how-to guides
Browse the full set of Neuron Runtime how-to guides for developers working directly with libnrt.
Reference
Runtime API Reference Documentation
Documentation of the APIs in the public headers for the Neuron Runtime.
Troubleshooting on Inf1 and Trn1
Solutions for common issues encountered when using the Neuron Runtime on Inferentia and Trainium instances.
Frequently Asked Questions
Answers to common questions about the Neuron Runtime, including compatibility, configuration, and usage.
Learn More
Explore the Neuron Runtime
Deep dives into the Neuron Runtime, including NEFF files, compute-communication overlap, device memory, and core dumps.
Collectives
About Collectives
Learn about Neuron Runtime collectives.
Deep Dive: Inter-node Collective Communication
Explore and understand techniques for communication across nodes in the Neuron Runtime.
Deep dive: Intra-node Collective Communication
Explore and understand techniques for communication within nodes in the Neuron Runtime.
Release Notes
Runtime Release Notes
Latest updates, improvements, and bug fixes for the Neuron Runtime library, driver, and collectives.
This document is relevant for: Inf2, Trn1, Trn2, Trn3