This document is relevant for: Inf1, Inf2, Trn1, Trn2

AWS Neuron SDK 2.25.0: PyTorch support release notes#

Date of release: July 31, 2025

Released versions#

  • 2.7.0.2.9.*

  • 2.6.0.2.9.*

Improvements#

  • The Core Placement API is no longer beta/experimental and the instructions on how to use it have been updated.

    To migrate, replace any function scope torch_neuron.experimental. with torch_neuron.. The change will have no effect on behavior or performance. For example, replace torch_neuronx.experimental.set_neuron_cores with torch_neuronx.set_neuron_cores. If you use torch_neuron.experimental. scope it will work as before but now will also emit this warning: “In a future version torch_neuronx.experimental.<func> will be removed. Call torch_neuronx.<func> instead.”

Known issues#

Note

  • [v2.7] Using the latest torch-xla v2.7 may result in increase in host memory usage compared torch-xla v2.6. In on example, LLama2 pretraining with ZeRO1 and sequence length 16k could see an increase of 1.6% in host memory usage.

  • Currently, when switching Ubuntu OS kernel version from 5.15 to 6.8, you may see performance differences due to the new kernel scheduler (CFS vs EEVDF). For example, BERT pretraining performance could be lower by up to 10%. You may try using an older OS kernel (i.e. Amazon Linux 2023) or experiment with the kernel real-time scheduler by running sudo chrt --fifo 99 before your command (i.e. sudo chrt --fifo 99 <script>) to improve the performance. Note that adjusting the real-time scheduler can also result in lower performance. See https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html for more information.

  • Currently, when using tensor split operation on a 2D array in the second dimension, the resulting tensors don’t have the expected data (pytorch/xla#8640). The work-around is to set XLA_DISABLE_FUNCTIONALIZATION=0. Another work-around is to use torch.tensor_split.

  • [v2.6] BERT pretraining performance is ~10% lower with torch-neuronx 2.6 compared to torch-neuronx 2.5. This is due to a known regression in torch-xla pytorch/xla#9037 and can affect other models with high graph tracing overhead. This is fixed in torch-xla v2.7. To work-around this issue in torch-xla v2.6, build the r2.6_aws_neuron branch of torch-xla as follows (see Install with support for C++11 ABI for C++11 ABI version):

     # Setup build env (make sure you are in a python virtual env). Replace "apt" with "yum" on AL2023.
     sudo apt install cmake
     pip install yapf==0.30.0
    wget https://github.com/bazelbuild/bazelisk/releases/download/v1.20.0/bazelisk-linux-amd64
    sudo cp bazelisk-linux-amd64 /usr/local/bin/bazel
    
    # Clone repos
    git clone --recursive https://github.com/pytorch/pytorch --branch v2.6.0
    cd pytorch/
    git clone --recursive https://github.com/pytorch/xla.git --branch r2.6_aws_neuron
    _GLIBCXX_USE_CXX11_ABI=0 python setup.py bdist_wheel
    
    # pip wheel will be present in ./dist
    cd xla/
    CXX_ABI=0 python setup.py bdist_wheel
    
    # pip wheel will be present in ./dist and can be installed instead of the torch-xla released in pypi.org
    
  • Currently, BERT pretraining performance is ~11% lower when switching to using model.to(torch.bfloat16) as part of migration away from the deprecated environment variable XLA_DOWNCAST_BF16 due to pytorch/xla#8545. As a workaround to recover the performance, you can set XLA_DOWNCAST_BF16=1 which would still work in torch-neuronx 2.5 and 2.6 although there will be deprecation warnings (as noted below).

  • Environment variables XLA_DOWNCAST_BF16 and XLA_USE_BF16 are deprecated (see the warning raised below). Switch to automatic mixed-precision or use model.to(torch.bfloat16) command to cast model to BF16. (see Migration From XLA_USE_BF16/XLA_DOWNCAST_BF16).

    Warning: XLA_DOWNCAST_BF16 will be deprecated after the 2.5 release, please downcast your model directly

  • [v2.6] AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'xrt_world_size'. This is an error that notes that torch_xla.core.xla_model.xrt_world_size() is removed in torch-xla version 2.7. Switch to using torch_xla.runtime.world_size() instead.

  • [v2.6] AttributeError: module 'torch_xla.core.xla_model' ... does not have the attribute 'get_ordinal'. This is an error that notes that torch_xla.core.xla_model.xla_model.get_ordinal() is removed in torch-xla version 2.7. Switch to using torch_xla.runtime.global_ordinal() instead.

  • AttributeError: module 'torch_xla.runtime' has no attribute 'using_pjrt'. In Torch-XLA 2.5+, torch_xla.runtime.using_pjrt is removed because PJRT is the sole Torch-XLA runtime. See this PyTorch commit PR on GitHub.

This document is relevant for: Inf1, Inf2, Trn1, Trn2