Mixed precision and performance-accuracy tuning (neuron-cc)#
The Neuron Compiler supports machine learning models with FP32, FP16 and BF16 (Bfloat16) tensors and operators. The Neuron hardware supports a mix of 32 and 16 bit datatypes. The available auto-cast methods and their performance / accuracy trade-offs are explained in this document.
Neuron Hardware#
The Neuron hardware supports matrix multiplication using FP16 or BF16 on its Matmult Engine, and accumulations using FP32. Similarly, operators such as activations or vector operations are supported using FP16, BF16 and FP32. Neuron supports tensor transpose in two ways - by fast matrix multiplication in FP16/BF16 or by slower byte-by-byte data movements.
Performance-accuracy tradeoffs for models trained in FP32#
Models that are trained using FP32 data types can be deployed on Neuron through ahead of time compilation using the Neuron Compiler.
Important
By default, the Neuron Compiler disables auto-casting and uses the data types defined within the model. This provides the best accuracy for FP32 trained models, but does not provide the best performance.
weights and operations to BF16**. Only partial sums are left in FP32. The default, casting will generate the highest performance for a FP32 trained model.
Using the --fast-math CLI option, you can choose the right
tradeoff between performance and accuracy. The tradeoff usually is between achieving high performance or optimal accuracy, and decision what settings to use will be application specific.
It is recommended that the you start with compiling the model to achieve the high performance (default), you can then test the accuracy of the application and, if needed, try the next higher precision casting option until the desired accuracy and performance are achieved. A typical flow can be:
You can compile without options (default) or with
--fast-math allwhich will optimize for performance.If accuracy is not sufficient you can try
--fast-math fp32-cast-matmultIf accuracy is not sufficient you can try
--fast-math fp32-cast-matmult no-fast-relayoutIf accuracy is not sufficient you can try
--fast-math nonewhich will optimize for accuracy .
Between step 2 and step 3, and between step 3 and step 4 you have additional options that can provide different level of accuracy and which are explained in the below section.
Note that compiler has to preserve the input/output (i/o) tensor types requested by Framework, therefore no casting is done on the i/o tensors. Additional speedup can be obtained by casting them in the Framework prior compilation.
To learn how to use compiler command line interface (CLI) options with your application’s framework, please see PyTorch-Neuron trace python API, TensorFlow 1.x (tensorflow-neuron) Compilation API and TensorFlow 2.x (tensorflow-neuron) Tracing API.
Compiler casting options#
--fast-math option#
The --fast-math option is intended to replace the --fp32-cast option. It is recommended to
to start using or migrating to --fast-math option. The --fast-math option provides the same level of functionality
as the --fp32-cast option in addition to the following:
The
--fast-mathoption introduces theno-fast-relayoutoption to enable lossless transpose operation. This was not possible with the--fp32-castoption.The
--fast-mathoption provides finer control than the--fp32-castoption. The transpose operation and the cast operation are controlled independently:no-fast-relayoutandfast-relayoutprovide control for the transpose operation.fp32-cast-*provide control for casting.
See the detailed list of the options in Neuron compiler CLI Reference Guide (neuron-cc).