Neuron Compiler FAQ (neuronx-cc)
This document is relevant for:
Neuron Compiler FAQ (
Table of contents
The one-time compilation step from the standard framework-level model to NEFF binary may be performed on any EC2 instance or even on-premises.
We recommend using a high-performance compute server of choice (C5 or z1d instance types), for the fastest compile times and ease of use with a prebuilt DLAMI. Developers can also install Neuron in their own environments; this approach may work well for example when building a large fleet for inference, allowing the model creation, training and compilation to be done in the training fleet, with the NEFF files being distributed by a configuration management application to the inference fleet.
Developers who want to train their models in FP32 for best accuracy can compile and deploy them with Neuron. The Neuron compiler automatically converts FP32 to internally supported datatypes, such as FP16 or BF16. You can find more details about FP32 data type support and performance and accuracy tuning in Mixed Precision and Performance-accuracy Tuning (neuronx-cc) or Mixed precision and performance-accuracy tuning (neuron-cc). The Neuron compiler preserves the application interface - FP32 inputs and outputs. Transferring such large tensors may become a bottleneck for your application. Therefore, you can improve execution time by casting the inputs and outputs to FP16 or BF16 in the ML framework prior to compilation.
You can use the
neuronx-cc list-operators command on the cli to list the operators. See Neuron Compiler CLI Reference Guide (neuronx-cc).
To request support for new operators, open an issue on our GitHub forum.
Models with control-flow and dynamic shapes are not supported now. You will need to partition the model using the framework prior to compilation.
The compiler and runtime are committed to maintaining compatibility for major version releases with each other. The versioning is defined as major.minor, with compatibility for all versions with the same major number. If the versions mismatch, an error notification is logged and the load will fail. This will then require the model to be recompiled.
** We will bring a utility out to help with this soon.
It depends on the model and its size and complexity, but this generally takes a few minutes.
NeuronCore-v2 Architecture supports multiple casting modes for floating point numbers, each with associated implications for performance and accuracy. The default casting mode is a pragmatic balance between performance and accuracy, however on some models it may result in loss of precision.
--auto-cast-type options in Neuron Compiler CLI Reference Guide (neuronx-cc) for details on how to adjust the casting mode.
neuronx-cc has explicit support for select model families using the
--model-type option, though many other model types are supported. You can also inspect supported operators using the
list-operators sub-command. See th Neuron Compiler CLI Reference Guide (neuronx-cc) for details.
More generally, support for new operators and models is continually being added. See our Roadmap for details.
This document is relevant for: