This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
AWS Neuron SDK 2.26.0: NxD Inference release notes#
Date of release: September 18, 2025
Version: 0.6.10598
Go back to the AWS Neuron 2.26.0 release notes home
Improvements#
Llama 4 model support (beta)#
Added beta support for Llama 4, which is a family of multi-modal MoE ope- weight LLMs by Meta that support text
and image inputs. Llama 4 is tested on Trn2
. Compatible models include:
In this beta release, Llama 4 model support has the following limitations:
The model is tested to be accurate up to a sequence length of 8192.
Model performance on Trn2 isn’t fully optimized.
To use Llama 4 with vLLM, you must compile the model outside of vLLM and specify the compiled model path using the
NEURON_COMPILED_ARTIFACTS
environment variable.
These limitations will be addressed in a future release.
For more information, see /libraries/nxd-inference/tutorials/llama4-tutorial.ipynb and NxD Inference - Production Ready Models.
FLUX.1 model support (beta)#
Added beta support for FLUX.1-dev, which is an open weight image generation model by Black Forest Labs. Flux.1-dev is tested on Trn2. Compatible models include:
In this beta release, the model’s performance isn’t optimized.
For more information, see Generating Images with Black Forest Labs Flux.1-Dev on TRN1/TRN2 and NxD Inference - Production Ready Models.
Expert parallelism support (beta)#
Added support for expert parallelism, which distributes expert processing across multiple NeuronCores. Expert parallelism improves performance for mixture-of-experts (MoE) models, particularly for models with a large number of experts, such as Llama 4 Maverick. For more information, see MoENeuronConfig.
Context parallelism improvements#
With this release, context parallelism is out of beta and includes several improvements.
Added support for sliding window attention (SWA) with context parallelism.
Added a strided context parallel flash attention kernel which includes compute elimination. This kernel is more performant than the existing content parallel flash attention kernel, especially at high sequence lengths. To use the kernel, enable
strided_context_parallel_kernel_enabled
in NeuronConfig.Fixed an accuracy issue in hybrid sharding configurations that use context parallelism and attention bias. Hybrid sharding refers to models with different sharding strategies for context encoding and token generation submodels, such as a configuration that uses context parallelism for context encoding and data parallelism for token generation.
On-device forward pipeline execution (Beta)#
Added support for a model-forward function that accepts both on-device abnd on-CPU input tensors. This feature improves performance in pipeline models by eliminating data transfer between device and CPU. For example, you can use this feature with Llama 4 (which accepts image and text inputs) to keep the vision encoder outputs on-device for the context encoding model to process.
To use pipeline execution, specify pipeline_execution=True
when you initialize a ModelWrapper. For more information, see how-to-use-fpem.
Other improvements#
Added support for PyTorch 2.8 and Python 3.11.
Added support for sequence parallelism in mixture-of-experts (MoE) routers. This change improves context encoding latency for MoE models that use sequence parallelism.
Enabled
temperature=0
as a valid option in dynamic on-device sampling. This temperature value specifies to use greedy sampling.Enabled
top_k
values of0
and-1
as valid options in dynamic on-device sampling. Thesetop_k
values specify to randomly pick a token from the vocabulary using a uniform distribution.
Bug fixes#
Fixed an issue where HuggingFaceGenerationAdapter performs redundant CPU sampling for models that use on-device sampling and
output_logits=True
. This fix improves the performance of models with this configuration.Other minor fixes and improvements.
Known issues#
spmd_mode = True
does not work when provided to theparallel_model_trace
API.parallel_model_trace
will be deprecated in the next Neuron SDK release.
Previous release notes#
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2