PyTorch Neuron (torch-neuronx) release notes
Contents
This document is relevant for: Trn1
PyTorch Neuron (torch-neuronx
) release notes#
Table of Contents
PyTorch Neuron for Trainium is a software package that enables PyTorch users to train their models on Trainium.
Release [1.12.0.1.4.0]#
Date: 12/12/2022
Summary#
What’s new in this release#
Added support for PyTorch 1.12.
Setting XLA_DOWNCAST_BF16=1 now also enables stochastic rounding by default (as done with XLA_USE_BF16=1).
Added support for capturing snapshots of inputs, outputs and graph HLO for debug.
Fixed issue with parallel compile error when both train and evaluation are enabled in HuggingFace fine-tuning tutorial.
Added support for LAMB optimizer in FP32 mode.
Resolved Issues#
NaNs seen with transformers version >= 4.21.0 when running HF BERT fine-tuning or pretraining with XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1#
When running HuggingFace BERT (any size) fine-tuning tutorial or pretraining tutorial with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you will see NaNs in the loss immediately at the first step. More details on the issue can be found at pytorch/xla#4152. The workaround is to use 4.20.0 or earlier (the tutorials currently recommend version 4.15.0) or add the line transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16
to your Python training script (as now done in latest tutorials). A permanent fix will become part of an upcoming HuggingFace transformers release.
Known Issues and Limitations#
Convolution is not supported#
In this release, convolution is not supported.
Number of data parallel training workers on one Trn1 instance#
The number of workers used in single-instance data parallel training can be one of the following values: 1 or 2 for trn1.2xlarge and 1, 2, 8 or 32 for trn1.32xlarge.
Release [1.11.0.1.2.0]#
Date: 10/27/2022
Summary#
What’s new in this release#
Added support for argmax.
Clarified error messages for runtime errors
NRT_UNINITIALIZED
andNRT_CLOSED
.When multi-worker training is launched using torchrun on one instance, framework now handles runtime state cleanup at end of training.
Resolved Issues#
Drop-out rate ignored in dropout operation#
A known issue in the compiler’s implementation of dropout caused drop-rate to be ignored in the last release. It is fixed in the current release.
Runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”#
Previously, when running MRPC fine-tuning tutorial with bert-base-*
model, you would encounter runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”. This is fixed in the current release.
Compilation error: “TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512]”#
Previously, when compiling MRPC fine-tuning tutorial with bert-large-*
and FP32 (no XLA_USE_BF16=1) for two workers or more, you would encounter compiler error that looks like Error message: TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512]
followed by Error class: KeyError
. Single worker fine-tuning is not affected. This is fixed in the current release.
Known Issues and Limitations#
Convolution is not supported#
In this release, convolution is not supported.
Number of data parallel training workers on one Trn1 instance#
The number of workers used in single-instance data parallel training can be one of the following values: 1 or 2 for trn1.2xlarge and 1, 2, 8 or 32 for trn1.32xlarge.
Release [1.11.0.1.1.1]#
Date: 10/10/2022
Summary#
This is the initial release of PyTorch Neuron that supports Trainium for users to train their models on the new EC2 Trn1 instances.
What’s new in this release#
Announcing the first PyTorch Neuron release for training.
XLA device support for Trainium
PyTorch 1.11 with XLA backend support in torch.distributed
torch-xla distributed support
Single-instance and multi-instance distributed training using torchrun
Support for ParallelCluster and SLURM with node-level scheduling granularity
Persistent cache for compiled graph
neuron_parallel_compile utility to help speed up compilation
Optimizer support: SGD, AdamW
Loss functions supported: NLLLoss
Python versions supported: 3.7, 3.8
Multi-instance training support with EFA
Support PyTorch’s BF16 automatic mixed precision
Known Issues and Limitations#
Convolution is not supported#
In this release, convolution is not supported.
Number of data parallel training workers on one Trn1 instance#
The number of workers used in single-instance data parallel training can be one of the following values: 1 or 2 for trn1.2xlarge and 1, 2, 8 or 32 for trn1.32xlarge.
Drop-out rate ignored in dropout operation#
A known issue in the compiler’s implementation of dropout caused drop-rate to be ignored. Will be fixed in a follow-on release.
Runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”#
Currently, when running MRPC fine-tuning tutorial with bert-base-*
model, you will encounter runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”.
This issue will be fixed in an upcoming release.
Compilation error: “TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512]”#
When compiling MRPC fine-tuning tutorial with bert-large-*
and FP32 (no XLA_USE_BF16=1) for two workers or more, you will encounter compiler error that looks like Error message: TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512]
followed by Error class: KeyError
. Single worker fine-tuning is not affected. This issue will be fixed in an upcoming release.
This document is relevant for: Trn1