This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
What’s New#
Neuron 2.24.0 (06/24/2025)#
Neuron version 2.24 introduces new inference capabilities including prefix caching, disaggregated inference (Beta), and context parallelization support (Beta). This release also includes NKI language enhancements and enhanced profiling visualizations for improved debugging and performance analysis. Neuron 2.24 adds support for PyTorch 2.7 and JAX 0.6, updates existing DLAMIs and DLCs, and introduces a new vLLM inference container.
Inference#
NxD Inference (NxDI) includes the following enhancements:
Prefix caching: Improves Time To First Token (TTFT) by up to 3x when processing common shared prompts across requests.
Disaggregated inference (Beta): Uses 1P1D (1 Prefill, 1 Decode) architecture to reduce prefill-decode interference and improve goodput.
Context parallelism (Beta): Improves TTFT for longer sequence lengths by processing context encoding in parallel across multiple NeuronCores.
Model support: Added beta support for Qwen 2.5 text models.
NxD Inference Library: Upgraded to support PyTorch 2.7 and Transformers 4.48.
Hugging Face Optimum Neuron 0.2.0 now supports PyTorch-based NxD Core backend for LLM inference, simplifying the implementation of new PyTorch model architectures. Models including Llama 3.1-8B and Llama-3.3-70B have migrated from Transformers NeuronX to the NxD backend.
Training#
Library Upgrades
NxD Training (NxDT) Library: Upgraded to support PyTorch 2.7 and Transformers 4.48.
JAX Training Support: Upgraded to JAX 0.6.0.
Neuron Kernel Interface (NKI)#
Language Enhancements
Dynamic while loop support: Enables on-chip execution of complex models like MoE.
New nki.language.gather_flattened: Provides efficient parallel tensor element gathering.
Enhanced accuracy: Improved
nki.language.sqrt
(equivalent tonumpy.sqrt
).
Instruction Set Architecture (ISA) Improvements
Advanced indexing: Improved performance for
nki.isa.nc_match_replace8
.
Neuron Tools#
Neuron Profiler Enhancements
Framework stack traces: Maps device instructions to model source code.
Scratchpad memory usage visualization: Shows tensor-level memory usage over time with HLO name association.
On-device collectives barriers: Identifies synchronization overhead.
HBM throughput visualization: Tracks data movement involving High Bandwidth Memory (HBM) over time.
NCCOM-TEST Improvements
Added
--report-to-json-file
flag: Outputs results in JSON format.Added
--show-input-output-size
flag: Explicitly displays input and output sizes based on operations.
Neuron Deep Learning Containers (DLCs)#
Updated containers with PyTorch 2.7 support for inference and training.
Added new inference container with NxD Inference and vLLM with FastAPI.
JAX DLCs now support JAX 0.6.0 training.
Neuron Deep Learning AMIs (DLAMIs)#
Updated MultiFramework DLAMIs to include PyTorch 2.7 and JAX 0.6.0.
Added new Single Framework DLAMIs for PyTorch 2.7 and JAX 0.6.0.
Neuron 2.24 Feature Release Notes#
What’s New |
Details |
Instances |
---|---|---|
NxD Core (neuronx-distributed) |
|
|
NxD Inference (neuronx-distributed-inference) |
|
|
NxD Training (neuronx-distributed-training) |
|
|
PyTorch NeuronX (torch-neuronx) |
|
|
Neuron Compiler (neuronx-cc) |
|
|
Neuron Kernel Interface (NKI) |
|
|
Neuron Tools |
|
|
Neuron Runtime |
|
|
Transformers NeuronX (transformers-neuronx) for Inference |
|
|
Neuron Deep Learning AMIs (DLAMIs) |
|
|
Neuron Deep Learning Containers (DLCs) |
|
|
Release Announcements |
|
|
For detailed release artifacts, see Release Artifacts.
Previous Releases#
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2