This document is relevant for: Inf2, Trn1, Trn2

Neuron Compiler CLI Reference Guide (`neuronx-cc`)#

This document describes the command line interface of the Neuron Compiler.

This reference is not relevant for applications that run the Neuron Compiler from within a machine learning framework (PyTorch-Neuron for example) since these options are passed from the framework directly to the compiler. Using the compiler command line may be desirable for applications that do not use a framework or customize existing frameworks. It is also possible to specify compiler options within the framework which will forward these options to the compiler using NEURON_CC_FLAGS.

Usage#

Optional parameters are shown in square brackets.

Neuron Compiler Command-Line Interface

neuronx-cc <command> [parameters]#

Available Commands:#

compile
list-operators

Common parameters for the Neuron CLI:

--help: Display a usage message of compiler options.
Use neuronx-cc <command> --help for information on a specific command.

neuronx-cc compile [parameters]#

Compile a model for use on the AWS Machine Learning Accelerator.

neuronx-cc compile <model_files>
--framework <framework_name>
--target <instance_family>
[--model-type <model>]
[--auto-cast <cast_mode>]
[--auto-cast-type <data_type>]
[--distribution-strategy <distribution_type>]
[--logical-nc-config <shard_degree>], or [-lnc <shard_degree>]
[--optlevel <opt_level>], or [-O <opt_level>]
[--enable-mixed-precision-accumulation]
[--enable-saturate-infinity]
[--enable-fast-context-switch]
[--enable-fast-loading-neuron-binaries]
[--logfile <filename>]
[--output <filename>]
[--verbose <level>]

Compile Parameters:

<model_files>: Input containing model specification.
The number of arguments required varies between frameworks:
- XLA: A local filename of a HLO file (hlo.pb) generated via XLA. See hlo.proto for the .proto description and inspect-compiled-programs for more information on how to generate such files.
--framework: Framework used to generate training model.

Valid values:
- XLA
--target: Name of the Neuron instance family on which the compiled model will be run.

Valid values:
- inf2
- trn1
- trn1n
- trn2
--model-type: Permit the compiler to attempt model-specific optimizations based upon type of model being compiled. (Default: generic)

Valid values:
- generic: Perform optimizations applicable to all types of inference and training models.
- transformer: Perform optimizations specific to Transformer models.
- unet-inference: Perform optimizations specific to certain U-Net model architectures when performing inference. U-Net models often have certain structures that result in excessive performance-impacting data transfers; this option allows the compiler to apply additional memory optimizations to prevent these data transfers and also allows the compiler to map larger normalization operators which would otherwise not successfully execute.
--auto-cast: Controls how the compiler makes tradeoffs between performance and accuracy for FP32 operations. (Default: matmult)

Valid values:
- matmult: Only cast FP32 operations that use the Neuron matrix-multiplication engine.
- all: Cast all FP32 operations to achieve highest performance. This option can potentially lower precision/accuracy.
- none: Leave all data types as defined in the model. Do not apply auto-casting data type optimizations.
A more complete discussion on how to use this option and its arguments is in Mixed Precision and Performance-accuracy Tuning for Training.

Note

If the --auto-cast option is specified, the --auto-cast-type compiler flag can be optionally set to define which lower-precision data type the compiler should use.
--auto-cast-type: When auto-cast mode is enabled, cast the FP32 operators to the lower-precision data type specified by this option. (Default: bf16)

Valid values:
- bf16: Cast the FP32 operations selected via the --auto-cast option to BF16 to achieve highest performance and preserve dynamic range.
- fp16: Cast the FP32 operations selected via the --auto-cast option to FP16 to achieve improved performance relative to FP32 and increased precision relative to BF16.
- tf32: Cast the FP32 operations selected via the --auto-cast option to TensorFloat-32.
- fp8_e4m3: Cast the FP32 operations selected via the --auto-cast option to a signed 8-bit floating point represented as a 4-bit exponent and 3-bit mantissa.
Note

If multiple competing options are specified then the option right-most on the command line will supercede previous options.
--distribution-strategy: Permit the compiler to attempt model-specific optimizations based upon type of model being compiled. (Default: generic)

Valid values:
- llm-training: Enable the compiler to perform optimizations applicable to large language model (LLMS) training runs that shard parameters, gradients, and optimizer states across data-parallel workers. This is equivalent to the previously documented option argument value of NEMO, which will be deprecated in a future release.
--logical-nc-config: Instructs the compiler to shard the input graph across physical NeuronCore accelerators. Possible numeric values are {1, 2}. (Only available on trn2; Default: 2)

Valid values:
- 1: instructs the compiler to shard the input graph across 1 physical NeuronCore, i.e., do not perform any input graph sharding.
- 2: [default on trn2] instructs the compiler to shard the input graph across 2 physical NeuronCores.
--optlevel: Specify the level of optimization the compiler should perform. Possible numeric values are {1, 2, 3}. (Default: 2)

Valid values:
- 1: enables the core performance optimizations in the compiler, while also minimizing compile time.
- 2: [default] provides the best balance between model performance and compile time.
- 3: may provide additional model execution performance but may incur longer compile times and higher host memory usage during model compilation.
Note

This option supercedes, and deprecates, the —enable-experimental-O1 option introduced in an earlier release.
--enable-mixed-precision-accumulation: Perform intermediate calculations of accumulation operators (such as softmax and layernorm) in FP32 and cast the result to the model-designated datatype. This improves the operator’s resulting accuracy.
--enable-saturate-infinity: Convert +/- infinity values to MAX/MIN_FLOAT for compiler-introduced matrix-multiply transpose computations that have a high risk of generating Not-a-Number (NaN) values. There is a potential performance impact during model execution when this conversion is enabled. (Only needed on trn1; while the trn2 compiler will accept this flag for compatibility reasons, it has no effect on the compilation.)
--enable-fast-context-switch: Optimize for faster model switching rather than execution latency.
This option will defer loading some weight constants until the start of model execution. This results in overall faster system performance when your application switches between models frequently on the same Neuron Core (or set of cores).
--enable-fast-loading-neuron-binaries: Save the compilation output file in an uncompressed format.
This creates executable files which are larger in size but faster for the Neuron Runtime to load into memory during model execution.
--logfile: Filename where compiler writes log messages. (Default: “log-neuron-cc.txt”).
--output: Filename where compilation output (NEFF archive) will be recorded. (Default: “file.neff”)
--verbose: Specify the level of output produced by the compiler. (Default: warning)

Valid values:
- info: Informational messages regarding the progress of model compilation (written to stdout).
- warning: Diagnostic messages that report model code that is not inherently erroneous but may be risky or suggest there may have been an error (written to stderr).
- error: The compiler detected a condition causing it not complete the compilation successfully (written to stderr).
- critical: The compiler encountered an unrecoverable error terminates immediately (written to stderr).
- debug: Extensive information regarding the compiler’s internal execution phases (written to stdout).

Example:

Compiling an XLA HLO:

neuronx-cc compile bert-model.hlo —-framework XLA -—target trn1 —-model-type transformer —-output bert.neff

neuronx-cc list-operators [parameters]#

Returns a newline (‘\n’) separated list of operators supported by the Neuron Compiler.

neuronx-cc list-operators
--framework <value>

List-Operators Parameters:

--framework: Framework in which the operators were registered.

Valid values:
- XLA: Operator names will be formatted according to the value used by XLA compiler in XlaBuilder.

Example:

neuronx-cc list-operators —framework XLA
...

Exit Statuses:

0: Compilation succeeded
<>0: An error occurred during compilation.

This document is relevant for: Inf2, Trn1, Trn2

Neuron Compiler CLI Reference Guide (neuronx-cc)

Contents

Neuron Compiler CLI Reference Guide (`neuronx-cc`)#

Usage#

Available Commands:#

Neuron Compiler CLI Reference Guide (neuronx-cc)

Contents

Neuron Compiler CLI Reference Guide (neuronx-cc)#

Usage#

Available Commands:#

Neuron Compiler CLI Reference Guide (`neuronx-cc`)#