This document is relevant for: Trn1

PyTorch Neuron (torch-neuronx) release notes#

PyTorch Neuron for Trainium is a software package that enables PyTorch users to train their models on Trainium.

Release [1.12.0.1.4.0]#

Date: 12/12/2022

Summary#

What’s new in this release#

  • Added support for PyTorch 1.12.

  • Setting XLA_DOWNCAST_BF16=1 now also enables stochastic rounding by default (as done with XLA_USE_BF16=1).

  • Added support for capturing snapshots of inputs, outputs and graph HLO for debug.

  • Fixed issue with parallel compile error when both train and evaluation are enabled in HuggingFace fine-tuning tutorial.

  • Added support for LAMB optimizer in FP32 mode.

Resolved Issues#

NaNs seen with transformers version >= 4.21.0 when running HF BERT fine-tuning or pretraining with XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1#

When running HuggingFace BERT (any size) fine-tuning tutorial or pretraining tutorial with transformers version >= 4.21.0 and using XLA_USE_BF16=1 or XLA_DOWNCAST_BF16=1, you will see NaNs in the loss immediately at the first step. More details on the issue can be found at pytorch/xla#4152. The workaround is to use 4.20.0 or earlier (the tutorials currently recommend version 4.15.0) or add the line transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16 to your Python training script (as now done in latest tutorials). A permanent fix will become part of an upcoming HuggingFace transformers release.

Known Issues and Limitations#

Convolution is not supported#

In this release, convolution is not supported.

Number of data parallel training workers on one Trn1 instance#

The number of workers used in single-instance data parallel training can be one of the following values: 1 or 2 for trn1.2xlarge and 1, 2, 8 or 32 for trn1.32xlarge.

Release [1.11.0.1.2.0]#

Date: 10/27/2022

Summary#

What’s new in this release#

  • Added support for argmax.

  • Clarified error messages for runtime errors NRT_UNINITIALIZED and NRT_CLOSED.

  • When multi-worker training is launched using torchrun on one instance, framework now handles runtime state cleanup at end of training.

Resolved Issues#

Drop-out rate ignored in dropout operation#

A known issue in the compiler’s implementation of dropout caused drop-rate to be ignored in the last release. It is fixed in the current release.

Runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”#

Previously, when running MRPC fine-tuning tutorial with bert-base-* model, you would encounter runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”. This is fixed in the current release.

Compilation error: “TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512]”#

Previously, when compiling MRPC fine-tuning tutorial with bert-large-* and FP32 (no XLA_USE_BF16=1) for two workers or more, you would encounter compiler error that looks like Error message:  TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512] followed by Error class:    KeyError. Single worker fine-tuning is not affected. This is fixed in the current release.

Known Issues and Limitations#

Convolution is not supported#

In this release, convolution is not supported.

Number of data parallel training workers on one Trn1 instance#

The number of workers used in single-instance data parallel training can be one of the following values: 1 or 2 for trn1.2xlarge and 1, 2, 8 or 32 for trn1.32xlarge.

Release [1.11.0.1.1.1]#

Date: 10/10/2022

Summary#

This is the initial release of PyTorch Neuron that supports Trainium for users to train their models on the new EC2 Trn1 instances.

What’s new in this release#

Announcing the first PyTorch Neuron release for training.

  • XLA device support for Trainium

  • PyTorch 1.11 with XLA backend support in torch.distributed

  • torch-xla distributed support

  • Single-instance and multi-instance distributed training using torchrun

  • Support for ParallelCluster and SLURM with node-level scheduling granularity

  • Persistent cache for compiled graph

  • neuron_parallel_compile utility to help speed up compilation

  • Optimizer support: SGD, AdamW

  • Loss functions supported: NLLLoss

  • Python versions supported: 3.7, 3.8

  • Multi-instance training support with EFA

  • Support PyTorch’s BF16 automatic mixed precision

Known Issues and Limitations#

Convolution is not supported#

In this release, convolution is not supported.

Number of data parallel training workers on one Trn1 instance#

The number of workers used in single-instance data parallel training can be one of the following values: 1 or 2 for trn1.2xlarge and 1, 2, 8 or 32 for trn1.32xlarge.

Drop-out rate ignored in dropout operation#

A known issue in the compiler’s implementation of dropout caused drop-rate to be ignored. Will be fixed in a follow-on release.

Runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”#

Currently, when running MRPC fine-tuning tutorial with bert-base-* model, you will encounter runtime error “invalid offset in Coalesced_memloc_…” followed by “Failed to process dma block: 1703”. This issue will be fixed in an upcoming release.

Compilation error: “TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512]”#

When compiling MRPC fine-tuning tutorial with bert-large-* and FP32 (no XLA_USE_BF16=1) for two workers or more, you will encounter compiler error that looks like Error message:  TongaSBTensor[0x7fb2a46e0830]:TongaSB partitions[0] uint8 %138392[128, 512] followed by Error class:    KeyError. Single worker fine-tuning is not affected. This issue will be fixed in an upcoming release.

This document is relevant for: Trn1