This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
NxD Inference Release Notes (neuronx-distributed-inference
)#
This document lists the release notes for Neuronx Distributed Inference library.
Neuronx Distributed Inference [0.2.0] (Beta) (Neuron 2.22.0 Release)#
Date: 04/03/2025
Models in this Release#
Llama 3.2 11B (Multimodal)
Features in this Release#
Multi-LoRA serving. This release adds support for multi-LoRA serving through vLLM by loading LoRA adapters at server startup. Multi-LoRA serving is currently supported for Llama 3.1 8B, Llama 3.3 70B, and other models that use the Llama architecture.
Custom quantization. You can now specify which layers or modules in NxDI to quantize or keep in full precision during inference. To configure which layers or modules to skip during quantization, use the
modules_to_not_convert
anddraft_model_modules_to_not_convert
attributes in NeuronConfig.Models quantized through external libraries. NxDI now supports inference of models that are quantized externally using quantization libraries such as LLMCompressor.
Async mode. This release adds support for async mode, which improves performance by asynchronously preparing the next forward call to a mode. To use async mode, enable the
async_mode
flag in NeuronConfig.CPU inference. You can now run models on CPU and compare against output on Neuron to debug accuracy issues. To use this feature, enable the
on_cpu
flag in NeuronConfig.Unit/module testing utilities. These common utilities include
build_module
,build_function
, andvalidate_accuracy
, which enable you to build a module or function and validate its accuracy on Neuron. You can use these utilities in unit/integration tests to verify your modeling code works correctly.Add support for models that use a custom
head_dim
value from InferenceConfig. This change enables support for models wherehead_dim
isn’t equivalent tohidden_size
divided bynum_attention_heads
.Input capture hooks. When you call the NeuronBaseForCausalLM forward function, you can provide an
input_capture_hook
function that will be called with the model inputs as arguments.Runtime warmup. To improve the performance of the first request sent to a model, NxD Inference now warms up the model during load. You can disable this behavior with the
skip_warmup
flag in NeuronConfig.
Backward Incompatible Changes#
Fix the behavior of the
do_sample
sampling flag. Previously, NxDI used greedy sampling whendo_sample=True
, which was a bug becausedo_sample=True
should result in multinomial sampling. If you usedo_sample=True
in a config where you intend to use greedy sampling, you must change it todo_sample=False
. As part of this change, the default value fordo_sample
is nowFalse
.Enforce that tensors in a model’s state_dict don’t share memory with other tensors. This change can cause models to fail to load if their tensors share memory, which now results in an error:
RuntimeError: Error while trying to find names to remove to save state dict
. To fix this issue, apply.clone().detach().contiguous()
to the model’s state_dict, and re-shard the weights.Change the quantization state_dict keys from
weight_scale
toscale
to match the NxD quantization scale keys and avoid any confusion. If you use quantization and have sharded weights from earlier versions of NxDI, you must re-shard the weights.If you use a model that skips quantization for certain modules (such as in Llama 3.1 405B FP8), you must now specify
modules_not_to_convert
to configure the modules that skip quantization.Validate when input size exceeds the model’s maximum length (
max_context_length
ormax_length
). NxD Inference now throws a ValueError if given an input that’s too large. To enable the previous behavior, where input is truncated to the maximum length, enable theallow_input_truncation
flag in NeuronConfig.
Other Changes#
Improve model performance by up to 50% (5-20% in most cases) by eliminating overheads in logging.
Upgrade
transformers
from v4.45 to v4.48.Deprecate NeuronConfig’s
logical_neuron_cores
attribute and replace it withlogical_nc_config
. The LNC config is now automatically set from theNEURON_LOGICAL_NC_CONFIG
environment variable if set.Deprecate NeuronConfig’s
trace_tokengen_model
attribute. This attribute is now determined dynamically based on other configuration attributes.Improve the performance of on-device sampling.
When running Llama models with LNC2, the sharded flash attention kernel is now automatically enabled when context length is 256 or greater. Previously, this kernel was enabled for context length of 1024 or greater. This change improves performance at smaller context lengths.
NeuronConfig now includes a
skip_sharding
flag that you can enable to skip weight sharding during model compilation. This option is useful in cases where you have already sharded weights, such as during iterative development, so you can iterate without re-sharding the weights each time you compile the model.NeuronApplicationBase now includes a
shard_weights
function that you can use to shard weights independent of compiling the model.Fix vanilla speculative decoding support for models with multiple EOS tokens.
Other minor fixes and improvements.
Known Issues and Limitations#
For some configurations that use continuous batching or vLLM, model warmup can cause
Numerical Error
during inference. If you encounter this error, setskip_warmup=True
in NeuronConfig to disable warmup and avoid this issue. To disable warmup in vLLM, pass"skip_warmup": true
inoverride_neuron_config
. For more information about how to configure vLLM, see vLLM Model Configuration.RuntimeError: Failed to execute the model status=1003 message=Numerical Error
Neuronx Distributed Inference [0.1.1] (Beta) (Neuron 2.21.1 Release)#
Date: 01/14/2025
Bug Fixes#
Fix minor issues with sampling params and add validation for sampling params.
Neuronx Distributed Inference [0.1.0] (Beta) (Neuron 2.21 Release)#
Date: 12/20/2024
Features in this Release#
NeuronX Distributed (NxD) Inference (neuronx-distributed-inference
) is
an open-source PyTorch-based inference library that simplifies deep learning
model deployment on AWS Inferentia and Trainium instances. Neuronx Distributed
Inference includes a model hub and modules that users can reference to
implement their own models on Neuron.
This is the first release of NxD Inference (Beta) that includes:
Support for Trn2, Inf2, and Trn1 instances
Support for the following model architectures. For more information, including links to specific supported model checkpoints, see NxD Inference - Production Ready Models.
Llama (Text), including Llama 2, Llama 3, Llama 3.1, Llama 3.2, and Llama 3.3
Llama (Multimodal), including Llama 3.2 multimodal
Mistral (using Llama architecture)
Mixtral
DBRX
Support for onboarding additional models.
Compatibility with HuggingFace checkpoints and
generate()
APIvLLM integration
Model compilation and serialization
Tensor parallelism
Speculative decoding
EAGLE speculative decoding
Medusa speculative decoding
Vanilla speculative decoding
Quantization
Dynamic sampling
Llama3.1 405B Inference Example on Trn2
Open Source Github repository: aws-neuron/neuronx-distributed-inference
For more information about the features supported by NxDI, see NxD Inference Features Configuration Guide.
Known Issues and Limitations#
Longer Load Times for Large Models#
Issue: Users may experience extended load times when working with large models, particularly during weight sharding and initial model load. This is especially noticeable with models like Llama 3.1 405B.
Root Cause: These delays are primarily due to storage performance limitations.
Recommended Workaround: To mitigate this issue, we recommend that you store model checkpoints in high-performance storage options:
Instance store volumes: On supported instances, instance store volumes offer fast, temporary block-level storage.
Optimized EBS volumes: For persistent storage with enhanced performance.
By using these storage optimizations, you can reduce model load times and improve your overall workflow efficiency.
Note: Load times may still vary depending on model size and specific hardware configurations.
Other Issues and Limitations#
Llama 3.2 11B (Multimodal) is not yet supported with PyTorch 2.5.
The following model architectures are tested only on Trn1 and Inf2:
Llama (Multimodal)
The following model architectures are tested only on Trn1:
Mixtral
DBRX
The following kernels are tested only on Trn2:
MLP
QKV
If you run inference with an prompt that is larger than the model’s
max_context_length
, the model will generate incorrect output. In a future release, NxD Inference will throw an error in this scenario.Continuous batching (including through vLLM) supports batch size up to 4. Static batching supports larger batch sizes.
To use greedy on-device sampling, you must set
do_sample
toTrue
.To use FP8 quantization or KV cache quantization, you must set the
XLA_HANDLE_SPECIAL_SCALAR
environment variable to1
.
Neuronx Distributed Inference [0.1.0] (Beta) (Trn2)#
Date: 12/03/2024
Features in this release#
NeuronX Distributed (NxD) Inference (neuronx-distributed-inference
) is
an open-source PyTorch-based inference library that simplifies deep learning
model deployment on AWS Inferentia and Trainium instances. Neuronx Distributed
Inference includes a model hub and modules that users can reference to
implement their own models on Neuron.
This is the first release of NxD Inference (Beta) that includes:
Support for Trn2 instances
Compatibility with HuggingFace checkpoints and
generate()
APIvLLM integration
Model compilation and serialization
Tensor parallelism
Speculative decoding
EAGLE speculative decoding
Medusa speculative decoding
Vanilla speculative decoding
Quantization
Dynamic sampling
Llama3.1 405B Inference Example on Trn2
Open Source Github repository: aws-neuron/neuronx-distributed-inference
For more information about the features supported by NxDI, see NxD Inference Features Configuration Guide.
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2