This document is relevant for: Inf1, Inf2, Trn1, Trn1n

Introducing PyTorch 2.0 Support (End of Support)#

Note

Neuron no longer supports PyTorch 2.0. Please migrate to PyTorch 2.1 via migration guide.

What are we introducing?#

Starting with the Neuron 2.15 release, customers will be able to upgrade to Beta version of PyTorch Neuron(torch-neuronx) supporting PyTorch 2.0. PyTorch/XLA 2.0 uses a new default runtime PJRT, which will also be used by PyTorch Neuron 2.0 Beta. Neuron plans to support torch.compile (TorchDynamo) feature in future release of the Neuron SDK.

We have updated PyTorch Neuron (torch-neuronx) Setup to include installation instructions for PyTorch Neuron 2.0 Beta for Ubuntu 20 and Ubuntu 22. Users will also have to make possible training and inference script changes which are shown below in migration guide.

How is PyTorch Neuron 2.0 different than PyTorch Neuron 1.13?#

By upgrading to PyTorch Neuron 2.0, we will be removing the previous XRT runtime and XRT server that manages your program, applications will now be managed by individual PJRT clients instead. For more details on the changes between XRT and PJRT with PyTorch/XLA see this documentation.

In addition, the behavior of init_process_group() and xm.rendezvous() APIs have been updated in PyTorch 2.0. Users might need to make possible code changes in the training/inference scripts which is discussed in the below migration guide

How can I install PyTorch Neuron 2.0 (Beta)?#

To install PyTorch Neuron 2.0 Beta please follow the PyTorch Neuron (torch-neuronx) Setup guides for Ubuntu 20 AMI and Ubuntu 22 AMI. PyTorch Neuron 2.0 Beta can be installed using the following:

python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.0.* torchvision

Note

PyTorch Neuron DLAMIs for Ubuntu 20 does not yet have a pre-installed PyTorch 2.0 Beta. Please use Ubuntu 20 AMI and Ubuntu 22 AMI setup guide instructions.

Migrate your application to PyTorch 2.0 and PJRT#

Please make sure you have first installed the PyTorch Neuron 2.0 Beta as described above in installation guide

Migrating Training scripts#

Following changes need to be made to migrate the training scripts.

Changes to init_process_group()

As PJRT backend is invoked along with the PyTorch/XLA backend, we need to initialize our backend for PJRT. Following code changes need to be made where init_process_group is called.

Old:

torch.distributed.init_process_group('xla')

New:

# Now we have to import pjrt_backend to use pjrt:// for the init_process_group
import torch_xla.experimental.pjrt_backend
# Also, to use pjrt functions after you call init_process_group
import torch_xla.experimental.pjrt as pjrt
# Call init_process_group with new pjrt:// init_method
torch.distributed.init_process_group('xla', init_method='pjrt://')
Changes to xm.rendezvous() behavior

As xm.rendezvous() behavior has changed in PyTorch/XLA 2.0, PyTorch Neuron 2.0 has implemented synchronization API to be compatible with the change. There are no code changes users have to do related to xm.rendezvous(). Users can however see possible performance drops and memory issues when calling xm.rendezvous() with a payload on large XLA graphs. These performance drops and memory issues will be addressed in future Neuron release.

Please see this BERT tutorial for an example of changes within a training script to migrate it to PyTorch Neuron 2.0

Migrating Inference scripts#

In most cases, there should not be any code changes needed in inference scripts unless init_process_group() is being called. If init_process_group is being called, users need to update the code as outlined in Changes to init_process_group() section above.

Troubleshooting#

init_process_group() Failure#

Description#

Because PyTorch/XLA has changed the init_method for init_process_group(), make sure you are using the correct parameters for this. A common error would be:

RuntimeError: No rendezvous handler for pjrt://

This error means you have not properly imported the rendezvous handler from torch_xla.experimental.pjrt_backend

Solution#

Make sure you are calling init_process_group and not forgetting the import statement like so:

import torch_xla.experimental.pjrt_backend
torch.distributed.init_process_group('xla', init_method='pjrt://')

Socket Error: Socket failed to bind#

Description#

In PyTorch 2.0, there needs to be a socket available for both torchrun and the init_process_group to bind. Both of these, by default, will be set to unused sockets. If you plan to use a MASTER_PORT environment variable then this error may occur, if the port you set it to is already in use.

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
RuntimeError: The server socket has failed to listen on any local network address.
The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Solution#

Please ensure if you are setting MASTER_PORT that the port you’re setting it to is not used anywhere else in your scripts. Otherwise, you can leave MASTER_PORT unset, and torchrun will set the default port for you.

Frequently Asked Questions (FAQ)#

What is the difference between PJRT and Neuron Runtime?#

PJRT is a separate runtime than Neuron Runtime. Both runtimes will be used by Neuron SDK to support PyTorch Neuron 2.0 Beta.

Do I need to update my scripts for PyTorch 2.0?#

Yes, script changes might be needed in Beta support. Please see the migration guide

What environment variables will be changed with PJRT?#

Any of the previous XRT or libtpu.so environment variables that start with XRT or TPU (ex:- TPU_NUM_DEVICES) can be removed from scripts. PJRT_DEVICE is the new environment variable to control your compute device, by default it will be set to NEURON.

What features will be missing with PyTorch Neuron 2.0 Beta?#

Because Neuron support for PyTorch 2.0 is still in beta, we have some missing features from PyTorch Neuron 1.13 that we expect to have available in future Neuron release. The following features are not currently available in PyTorch Neuron 2.0 Beta :

Can I use Neuron Distributed and Transformers Neuron libraries with PyTorch Neuron 2.0 Beta?#

Yes, please note that they will be considered Beta if using them with PyTorch Neuron 2.0 Beta.

Can I still use PyTorch 1.13 version?#

Yes, PyTorch 1.13 will continue to be supported.

This document is relevant for: Inf1, Inf2, Trn1, Trn1n