This document is relevant for: Inf1, Trn1

Data Types#

Introduction#

Inferentia and Trainium NeuronDevices include different NeuronCore versions, which support differnt data-types. This section describes what data-types are supported in each NeuronCore version, for details about NeuronCore versions see neuron_hw_arch.

NeuronCore v1 Data Types#

Neuron Data-Types#

Neuron enables developers to choose from multiple data-types. The supported data-types are FP32, FP16, and BF16. Developers can train their models on their platform of choice (e.g. EC2 P3 instances), and then easily move their trained models to EC2 Inf1 for execution.

Data Type S Range Precision
FP32 1 8 bits 23 bits
BF16 1 8 bits 7 bits
FP16 1 5 bits 10 bits

FP16/BF16 models#

Models natively trained in FP16/BF16 will be executed in their trained data-types. This is a straightforward migration from the training platform to Inf1.

FP32 models#

Neuron SDK supports automatic model conversion from FP32 to BF16 by default. This capability allows developers to train their models using FP32 format for the highest accuracy, and achieve performance benefits without having to worry about low-precision training (e.g. no need for loss-scaling during training). ML models are typically robust to FP32 to BF16 conversion, with minimal to no impact on accuracy. The conversion accuracy is model dependent; therefore, users are encouraged to benchmark the accuracy of the auto-converted model against the original FP32 trained model.

When the compiler is supplied with an unmodified FP32 model input it will automatically compile the model to run as BF16 on Inferentia. During inference the FP32 input data will be auto-converted internally by Inferentia to BF16 and the output will be converted back to FP32 data-type. For explicit FP16 inferencing, either use an FP16 trained model, or use an external tool (like AMP) to make the explicit conversions.

NeuronCore v2 Data Types#

The NeuronCore v2 supports the following data types:

  • 32 and 16-bit Floating Point (FP32 / FP16)

  • TensorFloat-32 (TF32)

  • Brain Floating Point (BFloat16)

  • Unsigned 8-bit integer (UINT8)

Note

Neuron Compiler support for UINT8 is planned for a future Neuron SDK release. See Neuron Compiler: Enable Neuron INT8 support for details.

The layout for these is as follows:

Data Type S Range Precision
FP32 1 8 bits 23 bits
TF32 1 8 bits 10 bits
BF16 1 8 bits 7 bits
FP16 1 5 bits 10 bits
UINT8 8 bits

Model Type Conversion#

The Neuron SDK supports automatic model conversion from FP32 to BF16 by default. This capability allows developers to train their models using FP32 format for the highest accuracy, and then achieve run-time performance benefits without having to worry about low-precision training (e.g. no need for loss-scaling during training). ML models are typically robust to FP32 to BF16 conversion, with minimal to no impact on accuracy. Since conversion accuracy is model dependent, users are encouraged to benchmark the accuracy of the auto-converted model against the original FP32 trained model.

See Mixed Precision and Performance-accuracy Tuning for Training for more details on supported data types and their properties.

The Neuron compiler offers the --auto-cast and --auto-cast-type options to specify automatic casting of FP32 tensors to other data types to address performance and accuracy tradeoffs. See the Neuron Compiler CLI Reference Guide for a description of these options.

NeuronCore v2 Rounding Modes#

Because floating point values are represented by a finite number of bits, they cannot represent all real numbers accurately. Floating point calculations that exceed their defined data type size are rounded. The NeuronCore v2 performs a Round-to-Nearest (RNE) algorithm with ties to Even by default. It also provides a new Stochastic Rounding mode. When Stochastic Rounding is enabled, the hardware will round the floating point value up or down using a proportional probability. This could lead to improved model convergence. Use the environment variable NEURON_RT_STOCHASTIC_ROUNDING_EN to select a rounding mode.

This document is relevant for: Inf1, Trn1