This document is relevant for: Inf1

Neuron compiler CLI Reference Guide (neuron-cc)#

This document describes the command line interface of the Neuron compiler. This reference is not relevant for applications that run neuron-cc from within a machine learning framework (TensorFlow-Neuron for example) since these options are passed from the framework directly to neuron-cc.

Using neuron-cc on the command line may be desirable for applications that do not use a framework, or customize existing frameworks. It is also possible to supply CLI commands to the framework as options to be passed through to the compiler.

Usage#

Optional parameters are shown in square brackets. See the individual framework guides for the correct syntax.

Neuron Compiler CLI

neuron-cc [options] <command> [parameters]#

Common options for the Neuron CLI:

  • --verbose (string) default=“WARN”:

    Valid values:

    • DEBUG

    • INFO

    • WARN

    • ERROR

Use neuron-cc <command> --help for information on a specific command.

Available Commands:#

  • compile

  • list-operators

neuron-cc compile [parameters]#

Compile a model for use on the AWS Inferentia Machine Learning Accelerator.

neuron-cc compile <file names> --framework <value> --io-config <value> [--neuroncore-pipeline-cores <value>] [--enable-saturate-infinity] [--enable-fast-loading-neuron-binaries] [--enable-fast-context-switch] [--fp32-cast cast-method] [--fast-math cast-method] [--output <value>]

Compile Parameters:

  • <file names>: Input containing model specification. The number of arguments required varies between frameworks:

  • --framework (string): Framework in which the model was trained.

    Valid values:

    • TENSORFLOW

    • MXNET

    • XLA

  • --neuroncore-pipeline-cores (int) (default=1): Number of neuron cores to be used in “NeuronCore Pipeline” mode. This is different from data parallel deployment (same model on multiple neuron cores). Refer to Runtime/Framework documentation for data parallel deployment options.

    Compile for the given number of neuron cores so as to leverage NeuronCore Pipeline mode.

    Note

    This is not used to define the number of Neuron Cores to be used in a data parallel deployment (ie the same model on multiple Neuron Cores). That is a runtime/framework configuration choice.

  • --output (string) (default=“out.neff”): Filename where compilation output (NEFF archive) will be recorded.

  • --io-config (string): Configuration containing the names and shapes of input and output tensors.

    The io-config can be specified as a local filename, a URI, or a string containing the io-config itself.

    The io-config must be formatted as a JSON object with two members “inputs” and “outputs”. “inputs” is an object mapping input tensor names to an array of shape and data type. “outputs” is an array of output tensor names. Consider the following example:

    {
     "inputs": {
        "input0:0": [[1,100,100,3], "float16"],
        "input1:0": [[1,100,100,3], "float16"]
     },
     "outputs": ["output:0"]
    }
    
  • --enable-saturate-infinity : Convert +/- infinity values to MAX/MIN_FLOAT for certain computations that have a high risk of generating Not-a-Number (NaN) values. There is a potential performance impact during model execution when this conversion is enabled.

  • --enable-fast-loading-neuron-binaries : Write the compilation output (NEFF archive) in uncompressed format which results in faster loading of the archive during inference.

  • --enable-fast-context-switch : Optimize for faster model switching rather than inference latency. This results in overall faster system performance when your application switches between models frequently on the same neuron core (or set of cores). The optimization triggered by this option for example defers loading some weight constants until the start of inference.

  • --fast-math : Controls tradeoff between performance and accuracy for fp32 operators. See more suggestions on how to use this option with the below arguments in Mixed precision and performance-accuracy tuning (neuron-cc).

    • all (Default): enables all optimizations that improve performance. This option can potentially lower precision/accuracy.

    • none : Disables all optimizations that improve performance. This option will provide best precision/accuracy.

    • Tensor transpose options

      • fast-relayout: Only enables fast relayout optimization to improve performance by using the matrix multiplier for tensor transpose. The data type used for the transpose is either FP16 or BF16, which is controlled by the fp32-cast-xxx keyword.

      • no-fast-relayout: Disables fast relayout optimization which ensures that tensor transpose is bit-accurate (lossless) but slightly slower.

    • Casting options

      • fp32-cast-all (Default): Cast all FP32 operators to BF16 to achieve highest performance and preserve dynamic range. Same as setting --fp32-cast all.

      • fp32-cast-all-fp16: Cast all FP32 operators to FP16 to achieve speed up and increase precision versus BF16. Same setting as --fp32-cast all-fp16.

      • fp32-cast-matmult: Only cast FP32 operators that use Neuron Matmult engine to BF16 while using FP16 for matmult-based transpose to get better accuracy. Same as setting --fp32-cast matmult.

      • fp32-cast-matmult-bf16: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to BF16 to preserve dynamic range. Same as setting --fp32-cast matmult-bf16.

      • fp32-cast-matmult-fp16: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to fp16 to better preserve precision. Same as setting --fp32-cast matmult-fp16.

    Important

    • all and none are mutually exclusive

    • all is equivalent to using fp32-cast-all fast-relayout (best performance)

    • none is equivalent to using fp32-cast-matmult-bf16 no-fast-relayout (best accuracy)

    • fp32-cast-* options are mutually exclusive

    • fast-relayout and no-fast-relayout are mutually exclusive

    • The fp32-cast-* and *-fast-relayout options will overwrite the default behavior in all and none.

    • For backward compatibility, the --fp32-cast option has higher priority over --fast-math. It will overwrite the FP32 casting options in any of the --fast-math options if --fp32-cast option is present explicitly.

  • --fp32-cast : Refine the automatic casting of fp32 tensors. This is being replaced by a newer –fast-math.

    Important

    • --fp32-cast option is being deprecated and --fast-math will replace it in future releases.

    • --fast-math is introducing the no-fast-relayout option to enable lossless transpose operation.

    The --fp32-cast is an interface for controlling the performance and accuracy tradeoffs. Many of the --fast-math values invoke (override) it.

    • all (default): Cast all FP32 operators to BF16 to achieve speed up and preserve dynamic range.

    • matmult: Cast only FP32 operators that use Neuron Matmult engine to BF16 while using fp16 for matmult-based transpose to get better accuracy.

    • matmult-fp16: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to fp16 to better preserve precision.

    • matmult-bf16: Cast only FP32 operators that use Neuron Matmult engine (including matmult-based transpose) to BF16 to preserve dynamic range.

    • all-fp16: Cast all FP32 operators to FP16 to achieve speed up and better preserve precision.

Log Levels:

Logs at levels “trace”, “debug”, and “info” will be written to STDOUT.

Logs at levels “warn”, “error”, and “fatal” will be written to STDERR.

Exit Status

0 - Compilation succeeded

>0 - An error occurred during compilation.

Examples

Compiling a saved TensorFlow model:

neuron-cc compile test_graph_tfmatmul.pb --framework TENSORFLOW --io-config test_graph_tfmatmul.config

Compiling a MXNet model:

neuron-cc compile lenet-symbol.json lenet-0001.params --framework MXNET --neuroncore-pipeline-cores 2 --output file.neff

Compiling an XLA HLO:

neuron-cc compile bert-model.hlo --framework XLA  --output file.neff
neuron-cc list-operators [parameters]#

Returns a newline (‘n’) separated list of operators supported by the NeuronCore.

  • TENSORFLOW: Operators will be formatted according to the value passed to the associated REGISTER_OP(“OperatorName”) macro.

    See https://www.tensorflow.org/guide/create_op#define_the_op_interface for more information regarding operator registration in TensorFlow.

  • MXNET: Operator names will be formatted according to the value passed to the associated NNVM_REGISTER_OP(operator_name) macro.

  • XLA: Operator names will be formatted according to the value used by XLA compiler in XlaBuilder.

    See https://www.tensorflow.org/xla/operation_semantics for more information regarding XLA operator semantics in XLA interface.

neuron-cc list-operators --framework <value>
  • --framework (string): Framework in which the operators were registered.

    Valid values:

    • TENSORFLOW

    • MXNET

    • XLA

Exit Status

0 - Call succeeded

>0 - An error occurred

Example

$ neuron-cc list-operators --framework TENSORFLOW
AddN
AdjustContrastv2
CheckNumbers
...

This document is relevant for: Inf1