Neuron Compiler CLI Reference Guide (neuronx-cc)
This document is relevant for:
Neuron Compiler CLI Reference Guide (
This document describes the command line interface of the Neuron Compiler.
This reference is not relevant for applications that run the Neuron Compiler from within a machine learning framework (PyTorch-Neuron for example) since these options are passed from the framework directly to the compiler. Using the compiler command line may be desirable for applications that do not use a framework or customize existing frameworks. It is also possible to specify compiler options within the framework which will forward these options to the compiler using NEURON_CC_FLAGS.
Optional parameters are shown in square brackets.
Neuron Compiler Command-Line Interface
- neuronx-cc <command> [parameters]#
Common parameters for the Neuron CLI:
--verbose: Specify the level of output produced by the compiler. (Default:
info: Informational messages regarding the progress of model compilation (written to stdout).
warning: Diagnostic messages that report model code that is not inherently erroneous but may be risky or suggest there may have been an error (written to stderr).
error: The compiler detected a condition causing it not complete the compilation successfully (written to stderr).
critical: The compiler encountered an unrecoverable error terminates immediately (written to stderr).
debug: Extensive information regarding the compiler’s internal execution phases (written to stdout).
--help: Display a usage message of compiler options.
neuronx-cc <command> --helpfor information on a specific command.
- neuronx-cc compile [parameters]#
Compile a model for use on the AWS Machine Learning Accelerator.
neuronx-cc compile <model_files> --framework <framework_name> --target <instance_family> [--model-type <model>] [--auto-cast <cast_mode>] [--auto-cast-type <data_type>] [--distribution-strategy <disribution_type>] [--enable-fast-context-switch>] [--enable-fast-loading-neuron-binaries] [--logfile <filename>] [--output <filename>]
<model_files>: Input containing model specification.
The number of arguments required varies between frameworks:
XLA: A local filename of a HLO file (hlo.pb) generated via XLA. See hlo.proto for the .proto description and inspect-compiled-programs for more information on how to generate such files.
--framework: Framework used to generate training model.
--target: Name of the Neuron instance family on which the compiled model will be run.
--model-type: Permit the compiler to attempt model-specific optimizations based upon type of model being compiled. (Default:
generic: Perform optimizations applicable to all types of inference and training models.
transformer: Perform optimizations specific to Transformer <https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)> models.
unet-inference: Perform optimizations specific to certain U-Net <https://en.wikipedia.org/wiki/U-Net> model architectures when performing inference. U-Net models often have certain structures that result in excessive performance-impacting data transfers; this option allows the compiler to apply additional memory optimizations to prevent these data transfers and also allows the compiler to map larger normalization operators which would otherwise not successfully execute.
--distribution-strategy: Specify the type of data parallelism sharding being used in the model
FSDP: Enable the compiler to perform optimizations applicable to models that use the PyTorch 2.x Fully Sharded Data Parallel (FSDP) APIs <https://pytorch.org/docs/stable/fsdp.html> to shard parameters, gradients, and optimizer states across data-parallel workers. These optimizations can also apply to models that use DeepSpeed <https://www.deepspeed.ai/> to implement distributed execution.
--auto-cast: Controls how the compiler makes tradeoffs between performance and accuracy for FP32 operations. (Default:
matmult: Only cast FP32 operations that use the Neuron matrix-multiplication engine.
all: Cast all FP32 operations to achieve highest performance. This option can potentially lower precision/accuracy.
none: Leave all data types as defined in the model. Do not apply auto-casting data type optimizations.
A more complete discussion on how to use this option and its arguments is in Mixed Precision and Performance-accuracy Tuning for Training.
--auto-castoption is specified, the
--auto-cast-typecompiler flag can be optionally set to define which lower-precision data type the compiler should use.
--auto-cast-type: When auto-cast mode is enabled, cast the FP32 operators to the lower-precision data type specified by this option. (Default:
bf16: Cast the FP32 operations selected via the
--auto-castoption to BF16 to achieve highest performance and preserve dynamic range.
fp16: Cast the FP32 operations selected via the
--auto-castoption to FP16 to achieve improved performance relative to FP32 and increased precision relative to BF16.
tf32: Cast the FP32 operations selected via the
--auto-castoption to TensorFloat-32.
fp8_e4m3: Cast the FP32 operations selected via the
--auto-castoption to a signed 8-bit floating point represented as a 4-bit exponent and 3-bit mantissa.
If multiple competing options are specified then the option later in the command line will supercede previous options.
--enable-fast-context-switch: Optimize for faster model switching rather than execution latency.
This option will defer loading some weight constants until the start of model execution. This results in overall faster system performance when your application switches between models frequently on the same Neuron Core (or set of cores).
--enable-fast-loading-neuron-binaries: Save the compilation output file in an uncompressed format.
This creates executable files which are larger in size but faster for the Neuron Runtime to load into memory during model execution.
--logfile: Filename where compiler writes log messages. (Default: “log-neuron-cc.txt”).
--output: Filename where compilation output (NEFF archive) will be recorded. (Default: “file.neff”)
Compiling an XLA HLO:
neuronx-cc compile bert-model.hlo —-framework XLA -—target trn1 —-model-type transformer —-output bert.neff
- neuronx-cc list-operators [parameters]#
Returns a newline (‘\n’) separated list of operators supported by the Neuron Compiler.
neuronx-cc list-operators --framework <value>
--framework: Framework in which the operators were registered.
XLA: Operator names will be formatted according to the value used by XLA compiler in XlaBuilder.
neuronx-cc list-operators —framework XLA ...
0: Compilation succeeded
<>0: An error occurred during compilation.
This document is relevant for: