This document is relevant for: Inf1, Inf2, Trn1, Trn2

AWS Neuron SDK 2.26.0: NxD Inference release notes#

Date of release: September 18, 2025

Version: 0.6.10598

Go back to the AWS Neuron 2.26.0 release notes home

Improvements #

Llama 4 model support (beta)#

Added beta support for Llama 4, which is a family of multi-modal MoE ope- weight LLMs by Meta that support text and image inputs. Llama 4 is tested on Trn2. Compatible models include:

In this beta release, Llama 4 model support has the following limitations:

The model is tested to be accurate up to a sequence length of 8192.
Model performance on Trn2 isn’t fully optimized.
To use Llama 4 with vLLM, you must compile the model outside of vLLM and specify the compiled model path using the NEURON_COMPILED_ARTIFACTS environment variable.

These limitations will be addressed in a future release.

For more information, see Tutorial: Deploying Llama4 Multimodal Models and NxD Inference - Production Ready Models.

FLUX.1 model support (beta)#

Added beta support for FLUX.1-dev, which is an open weight image generation model by Black Forest Labs. Flux.1-dev is tested on Trn2. Compatible models include:

Flux.1-dev

In this beta release, the model’s performance isn’t optimized.

For more information, see Generating Images with Black Forest Labs Flux.1-Dev on TRN1/TRN2 and NxD Inference - Production Ready Models.

Expert parallelism support (beta)#

Added support for expert parallelism, which distributes expert processing across multiple NeuronCores. Expert parallelism improves performance for mixture-of-experts (MoE) models, particularly for models with a large number of experts, such as Llama 4 Maverick. For more information, see MoENeuronConfig.

Context parallelism improvements #

With this release, context parallelism is out of beta and includes several improvements.

Added support for sliding window attention (SWA) with context parallelism.
Added a strided context parallel flash attention kernel which includes compute elimination. This kernel is more performant than the existing content parallel flash attention kernel, especially at high sequence lengths. To use the kernel, enable strided_context_parallel_kernel_enabled in NeuronConfig.
Fixed an accuracy issue in hybrid sharding configurations that use context parallelism and attention bias. Hybrid sharding refers to models with different sharding strategies for context encoding and token generation submodels, such as a configuration that uses context parallelism for context encoding and data parallelism for token generation.

On-device forward pipeline execution (Beta)#

Added support for a model-forward function that accepts both on-device abnd on-CPU input tensors. This feature improves performance in pipeline models by eliminating data transfer between device and CPU. For example, you can use this feature with Llama 4 (which accepts image and text inputs) to keep the vision encoder outputs on-device for the context encoding model to process.

To use pipeline execution, specify pipeline_execution=True when you initialize a ModelWrapper. For more information, see How to use on-device Forward Pipeline Execution Mode for optimization.

Other improvements #

Added support for PyTorch 2.8 and Python 3.11.
Added support for sequence parallelism in mixture-of-experts (MoE) routers. This change improves context encoding latency for MoE models that use sequence parallelism.
Enabled temperature=0 as a valid option in dynamic on-device sampling. This temperature value specifies to use greedy sampling.
Enabled top_k values of 0 and -1 as valid options in dynamic on-device sampling. These top_k values specify to randomly pick a token from the vocabulary using a uniform distribution.

Bug fixes #

Fixed an issue where HuggingFaceGenerationAdapter performs redundant CPU sampling for models that use on-device sampling and output_logits=True. This fix improves the performance of models with this configuration.
Other minor fixes and improvements.

Known issues #

spmd_mode = True does not work when provided to the parallel_model_trace API. parallel_model_trace will be deprecated in the next Neuron SDK release.

Previous release notes #

This document is relevant for: Inf1, Inf2, Trn1, Trn2

AWS Neuron SDK 2.26.0: NxD Inference release notes

Contents

AWS Neuron SDK 2.26.0: NxD Inference release notes#