What is AWS Neuron?#
AWS Neuron is the software stack for running deep learning and generative AI workloads on AWS Trainium and AWS Inferentia. Built on an open source foundation, Neuron enables developers to build, deploy and explore natively with PyTorch and JAX frameworks and with ML libraries such as Hugging Face, vLLM, PyTorch Lightning, and others without modifying your code. It includes a compiler, runtime, training and inference libraries, and developer tools for monitoring, profiling, and debugging. Neuron supports your end-to-end machine learning (ML) development lifecycle from building and deploying deep learning and AI models, optimizing to achieve highest performance and lowest cost, and getting deeper insights into model behavior.
Neuron enables rapid experimentation, production scale training of frontier models, low level performance optimization through the Neuron Kernel Interface (NKI) for custom kernels, cost optimized inference deployment for agentic AI and reinforcement learning workloads, and comprehensive profiling and debugging with Neuron Explorer.
For more details, see the detailed documentation under About the AWS Neuron SDK.
Who is AWS Neuron for?#
ML engineers can use Neuron’s vLLM integration to migrate their models to Trainium for improved performance and without code modifications. They can
Performance engineers can use NKI and our Developer Tools to create new ML kernels and optimize existing ones.
ML researchers can use their existing PyTorch experience and ecosystem tools to experiment freely on Trainium using our native PyTorch implementatio, without having to learn new frameworks or APIs
What is AWS Neuron used for?#
Research and Development: Neuron provides native PyTorch execution on Trainium with full Eager mode compatibility. The stack supports standard distributed training patterns including FSDP, DDP, and DTensor for model sharding across devices and nodes. torch.compile integration enables graph optimization, while existing frameworks like TorchTitan and HuggingFace Transformers run without code modifications. JAX support includes XLA compilation targeting Inferentia and Trainium hardware.
Production Inference: Neuron implements vLLM V1 API compatibility on Trainium and Inferentia with optimizations for large-scale inference workloads. The runtime supports Expert Parallelism for MoE models, disaggregated inference architectures, and speculative decoding. Optimized kernels from the NKI Library provide hardware-specific implementations. Training workflows integrate with HuggingFace Optimum Neuron, PyTorch Lightning, and TorchTitan, with seamless deployment through standard vLLM interfaces.
Performance Engineering: Neuron Kernel Interface (NKI) provides direct access to Trainium instruction set architecture with APIs for memory management, execution scheduling, and low-level kernel development. The NKI Compiler, built on MLIR, offers full visibility into the compilation pipeline from high-level operations to hardware instructions. The NKI Library contains optimized kernel implementations with source code and performance benchmarks. Neuron Explorer enables comprehensive profiling from application code to hardware execution, supporting both single-node and distributed workload analysis with detailed performance metrics and optimization recommendations.
AWS Neuron Core Components#
- vLLM
Neuron enables production inference deployment with standard frameworks and APIs on Trainium and Inferentia. Use Neuron’s vLLM integration with standard APIs to deliver high-performance model serving with optimized kernels from the NKI Library.
It provides:
Standard vLLM APIs: Full compatibility with vLLM V1 APIs, enabling customers to use familiar vLLM interfaces on Neuron hardware without code changes
Advanced Inference Features: Support for Expert Parallelism for MoE models, disaggregated inference for flexible deployment architectures, and speculative decoding for improved latency
Optimized Performance: Pre-optimized kernels from the NKI Library for peak performance across dense, MoE, and multimodal models
Open Source: Source code released under the vLLM project organization with source code on GitHub, enabling community contributions
- Native PyTorch
Neuron provides native integration with PyTorch, enabling researchers and ML developers to run existing code unchanged on Trainium. Train models with familiar workflows and tools, from pre-training to post-training with reinforcement learning, while leveraging Trainium’s performance and cost advantages for both experimentation and production scale training.
It provides:
Native Device Support: Neuron registers as a native device type in PyTorch with standard device APIs like
torch.tensor([1,2,3], device='neuron')and.to('neuron')Standard Distributed Training APIs: Support for FSDP, DTensor, DDP, tensor parallelism, context parallelism, and distributed checkpointing
Eager Mode Execution: Immediate operation execution for interactive development and debugging in notebook environments
torch.compile Integration: Support for
torch.compilefor optimized performanceOpen Source: Released as an open source package on GitHub under Apache 2.0, enabling community contributions.
- Neuron Kernel Interface (NKI)
For performance engineers seeking maximum hardware efficiency, Neuron provides complete control through the Neuron Kernel Interface (NKI), with direct access to the NeuronISA (NISA) instruction set, memory allocation, and execution scheduling. Developers can create new operations not available in standard frameworks and optimize performance critical code with custom kernels.
It includes:
The NKI Compiler, built on MLIR, which provides greater transparency into the kernel compilation process
The NKI Library , which provides pre-built kernels you can use to optimize the performance of your models
- Neuron Tools
Debug and profiling utilities including:
Neuron Monitor for real-time performance monitoring
Neuron Explorer, built on the Neuron Profiler (
neuron-profile), for detailed performance analysis
Neuron Explorer provides:
Hierarchical Profiling: Top-down visualization from framework layers through HLO operators to hardware instructions, enabling developers to understand execution at any level of the stack
Code Linking: Direct navigation between PyTorch, JAX, and NKI source code and performance timeline with automatic annotations showing metrics for specific code lines
IDE Integration: VSCode extension for profile visualization and analysis directly within the development environment
Device Profiling: Unified interface for comprehensive view of system-wide metrics and device-specific execution details
- Neuron Compiler
Optimizes machine learning models for AWS Inferentia and Trainium chips, converting models from popular frameworks into efficient executable formats.
- Neuron Runtime
Manages model execution on Neuron devices, handling memory allocation, scheduling, and inter-chip communication for maximum throughput.
- AWS DLAMIs and DLCs
Orchestrate and deploy your models using Deep Learning AWS Machine Images (DLAMIs) and Deep Learning Containers (DLCs).
Neuron DLAMIs come pre-configured with the Neuron SDK, popular frameworks, and helpful libraries, allowing you to quickly begin training and running inference on AWS Inferentia. Or, quickly deploy models using pre-configured AWS Neuron Deep Learning Containers (Neuron DLCs) with optimized frameworks for AWS Trainium and Inferentia.
Supported Hardware#
- AWS Inferentia
Purpose-built for high-performance inference workloads:
Inf1instances - First-generation Inferentia chipsInf2instances - Second-generation with improved performance and efficiency
- AWS Trainium
Designed for distributed training of large models:
Trn1instances - High-performance training accelerationTrn1ninstances - Enhanced networking for large-scale distributed trainingTrn2instances - Next-generation Trainium with superior performanceTrn2UltraServer - High-density Trainium servers for massive training workloadsTrn3UltraServer – The next generation of Trainium servers for massive training workloads
How do I get more information?#
Review the comprehensive documentation and follow the tutorials on this site
Check the Neuron GitHub repositories for code examples. GitHub repos include:
Visit the AWS Neuron support forum for community assistance