This document is relevant for: Inf1

Neuron Compiler (neuron-cc) for Inf1 Release Notes#

Introduction#

This document lists the release notes for AWS Neuron compiler. The Neuron Compiler is an ahead-of-time compiler that ensures Neuron will optimally utilize the Inferentia chips.

Operator-support for each input format is provided directly from the compiler.

neuron-cc list-operators --framework {TENSORFLOW | MXNET | XLA}

The supported operators are also listed here:

Tensorflow: TensorFlow Neuron (tensorflow-neuron (TF1.x)) Supported operators

Pytorch: PyTorch Neuron (torch-neuron) Supported operators

XLA: TensorFlow Neuron (tensorflow-neuron (TF1.x)) Supported operators [XLA]

Apache MXNet: Neuron Apache MXNet Supported operators

Known issues and limitations - updated 11/23/2022#

  • There is a known issue of increased latency and lower throughput when MLM head is compiled along with BERT model. The workaround is to compile them separately and feed the raw Bert into the head.

  • TensorFlow 2.x - In this release supported operators are limited to BERT-like models, specifically no conv2d or reduce-window operators are available.

  • Control flow Neuron only supports control flow operators which are static at compile time. For example static length RNN, top-k, sort.

  • Data layout The Neuron compiler supports multiple data layout format (NCHW, NHWC, …). Non-CNHW input/output data-layouts will require Neuron to insert additional transpose operations, causing a degradation in performance.

  • Primary inputs in NeuronCore Pipeline mode When a neural network is executed in NeuronCore Pipeline mode, only the first operator in a neural network can receive primary inputs from the host.

  • Reduce data type INT8 data type is not currently supported by the Neuron compiler.

  • NeuronCore Pipeline: NeuronCorePipeline mode provides low-latency and high-throughput for small batch sizes. We recommend to start testing with batch=1 and gradually increase batch size to fine tune your model throughput and latency performance.

  • Large input tensors support varies by model. On some models the large input tensors (eg 1024x1024) may result in lower performance or exceeding hardware or compile-time limits, especially on models where the large input tensor is used by many downstream operators. Workarounds may include use of smaller batch, see Neuron Batching

  • Conv2d operator is mapped to Inferentia except for specific cases of extremely large tensors and specific parameters.

  • Conv3d operator performance is limited when the operator has small number of input channels (< 64).

  • FP64 and INT64 input and output tensors are not supported. Please cast to FP32/INT32 in the machine learning framework, prior compiling for Neuron.

Neuron Compiler release [1.21.0.0]]#

Date: 12/21/2023

  • Minor bug fixes.

Neuron Compiler release [1.20.3.0]]#

Date: 10/26/2023

  • Minor bug fixes.

Neuron Compiler release [1.19.0.0]]#

Date: 09/15/2023

  • Minor bug fixes.

Neuron Compiler release [1.17.0.0]]#

Date: 7/19/2023

New in this release#

  • This release introduces a new --enable-saturate-infinity compiler option. A computation that can generate +/- infinity is at a high risk of generating Not-a-Number (NaN) values when the infinity value is used in subsequent computations. This option helps avoid this by converting +Inf/-Inf values to MAX/MIN_FLOAT before operations that could produce NaN values for +Inf/-Inf inputs on the target architecture. While this option helps to avoid NaN values, there is a potential performance degradation that occurs during model execution when this conversion is enabled.

  • Minor bug fixes.

Neuron Compiler release [1.16.2.0]#

Date: 6/14/2023

  • Minor bug fixes.

Neuron Compiler release [1.15.0.0]#

Date: 05/01/2023

  • Minor bug fixes.

Neuron Compiler release [1.14.3.0]#

Date: 04/19/2023

  • Minor bug fixes.

Neuron Compiler release [1.13.3.0]#

Date: 11/23/2022

  • Resolved long compile-times when compiling the YOLOv5 and YOLOv6 models. [GitHub · aws-neuron-sdk · #434]

  • Improved the layout algorithm to resolve an issue compiling a transformer-based text recognition model. [GitHub · aws-neuron-sdk · #410]

  • Support was added for additional XLA operators

Neuron Compiler release [1.11.7.0]#

Date: 08/02/2022

  • Fixed a bug for correct handling of mxnet dropout instruction when mode is set as ‘training’ while performing inference.

Neuron Compiler release [1.11.4.0]#

Date: 04/29/2022

  • Solved an issue that caused a “false positive” reporting of a data race that may occur due to address overlap.

  • Minor bug fixes.

Neuron Compiler release [1.10.3.0]#

Date: 03/25/2022

  • Minor bug fixes.

Neuron Compiler release [1.9.1.0]#

Date: 01/20/2022

  • Fixed an issue with frontend compiler for fused operators that was reported in github #362.

Neuron Compiler release [1.8.5.0]#

Date: 01/05/2022

New in this release#

  • Minor bug fixes.

Neuron Compiler release [1.8.2.0]#

Date: 12/15/2021

New in this release#

  • Performance enhancements as a result of improved layout and DMA optimizations.

  • Minor bug fixes.

Neuron Compiler release [1.7.3.0]#

Date: 10/27/2021

New in this release#

  • The compiler’s list-operators command can now display the supported TensorFlow 2.x operators.

  • Support added for new operators in TensorFlow 1.x - ArgMax and ArgMin.

  • Introducing the –-fast-math option for better fine-tuning of accuracy/performance. See Mixed precision and performance-accuracy tuning (neuron-cc)

[1.6.13.0]#

Date 08/12/2021

New in this release#

  • TensorFlow 2.x - First support of TensorFlow 2.x. The support is limited to operators in BERT-like models and was tested with Huggingface BERT small, base, large and DistillBert.

Resolved issues#

  • Fixed compiler backend issue in Tensor_tensor argument distance, github #269

[1.5.5.0]#

Date 07/02/2021

Summary#

  • Robustness and performance improvements.

New in this release#

  • Added --enable-fast-context-switch option to optimize for faster model switching rather than inference latency.

  • Deprecated support for ONNX

  • Improved robustness of Conv3d

  • Corrected compilation error “too many instructions” in DLRM model

[1.4.0.0]#

Date 5/28/2021

Summary#

  • Performance improvements, and usability improvements.

New in this release#

  • Added uncompressed NEFF format for faster loading models prior inference. Enable it by –enable-fast-loading-neuron-binaries. Some cases of large models may be detrminentally impacted as it will not be compressed but many cases will benefit.

  • Corrected compilation error in specific arguments of ResizeBilinear operator

[1.3.0.0]#

Date 4/30/2021

Summary#

  • Performance improvements, new operators, and usability improvements.

New in this release#

  • Improved performance of batched CNN models like resnet50 with the default compiler options by 10%.

  • Improved performance of bert base sequence 128 batch 6 by upto 16%

  • Added support for group and depth wise convolution (with limited performance when the number of input channels is small).

  • Added more detailed debug names to support for tensorboard.

Resolved Issues#

  • Corrected potential race condition in overwriting tiles of output tensors.

  • Fixed various issues in pipelined inference by enabling fine grain partitioning by default.

[1.2.7.0]#

Date 2/24/2021

Summary#

Fix for CVE-2021-3177.

[1.2.2.0]#

Date 1/30/2021

Summary#

Added suport for multiple new operators (see operators list) for Tensoflow and MXNET. Improved inference performance of language, object recognition models on single as well as multiple pipelined cores using neuroncore-pipeline.

New in this release#

  • Improved inference performance on yolov4 BERT base sequence 64 (on 16 pipelined cores) and openpose 184.

Resolved Issues#

  • Corrected a random failure to compile Resnet50 batch 5

  • Corrected numerical inaccuracy in RSQRT and related operators for tensors with very large values ( > 1e20)

[1.1.7.0]#

Date 12/23/2020

Summary#

Added suport for PyTorch Yolo V4, a new Framework-visible progress bar and improved inference performance. We continue to streamline the compiler usability by removing the need for options passed to control behavior. We are aiming to remove the need for such options entirely. Some tutorials have been updated to reflect this, but Resnet50 remains in need of these options to achieve maximum performance. Other useability improvements have been added, such as the compiler progress bar. As always, please let us know if there are other areas that we can improve.

New in this release#

  • Pytorch Yolo V4 is now supported.

  • Added a compiler progress bar when compilation is invoked from the Framework. This allows the user to see that progress continues as compilation proceeds, which is useful when compilation takes several minutes. A dot is printed every 20 seconds.

  • Improved inference performance of Tensorflow BERT base seq 256 batch 3 by 10% .

Resolved Issues#

  • Resolved issue with depthwise convolution that manifests as a type check error

[1.0.24045.0]#

Date 11/17/2020

Summary#

Improved performance for pipelined execution (NeuronCore Pipeline).

New in this release#

  • NeuronCore Pipeline: improved partitioning to enable better static weights loading to cache.

Resolved Issues#

  • –static-weights : No longer needed. As this is shown in some examples, please remove the option since the compiler now performs this auto-detection by default.

  • –num-neuroncores renamed to –neuroncore-pipeline-cores. The prior option form is still functional (backwards compatible) and will be removed in future releases.

  • –batching_en: Resolved compilation failure of ResNet50 FP32 batch 1 on Ubuntu16 when “–batching_en” was used.

[1.0.20600.0]#

Date 9/22/2020

Summary#

Various performance improvements - both compilation time and inference speed of object recognition models.

  • Compiler optimization ‘-O2’ option is now enabled by default.

New in this release#

  • Improved inference performance of YOLO v3, YOLO v4, VGG16, SSD300. BERT models were improved by an additional 10%.

  • Modifed such that -O2 is now the default behavior and does not need to be specified. Note: some tutorials still explicitly specify “-O2”. These will be modified in forthcoming updates.

Resolved Issues#

  • Sped up compilation of large models that were taking hours to sub-40 minute.

[1.0.18001.0]#

Date 8/08/2020

Summary#

Various performance improvements.

New in this release#

Improved performance of BERT base with -O2

Resolved Issues#

  • n/a

[1.0.17937.0]#

Date 8/05/2020

Summary#

Various improvements.

[1.0.16861.0]#

Date 7/16/2020

Summary#

This release has some bug fixes and some functional and performance improvements to support compilation of several neural networks.

New in this release#

This release

  • Supports compilation of PoseNet, tested for images of specific resolutions upto 736.

  • Update the -O2 with a new memory allocator to reduce spilling to DRAM

  • Improved performance of the ‘-O2’ on BERT base, and openpose pose network.

Resolved Issues#

  • Resolved compilation error in Vgg16 batch 1

Other Notes#

  • Some versions of Inception network may fail to compile in Tensorflow on Ubuntu 16 in conda environment. The symptom is neuron-cc backend data race error. As a workaround use Ubuntu 18, Amazon Linux 2, or virtual env, or use neuron-cc with flag -O2.

[1.0.15275.0]#

Date 6/11/2020

Summary#

This release has some bug fixes and some functional and performance improvements to support compilation of several neural networks.

New in this release#

This release

  • Supports compilation of PoseNet for images of specific resolutions upto 400x400.

  • Improves performance of resnet152.

  • Supports a new command line option ‘-O2’ that can help with handling of large tensor inputs for certain models.

  • increase NEFF versions to 1.0. This means new NEFFs compiled from this release forward are not compatible with older versions of Neuron Runtime prior to May, 2020 (1.0.6905.0) release. Please update the Neuron Runtime when using NEFF version 1.0.

Resolved Issues#

  • Compilation issues on prosotron encoder, decoder neural networks.

Other Notes#

Dependencies#

  • This version creates NEFF 1.0 thus may require update of neuron-rtd if older than May 2020 release.

dmlc_nnvm==1.0.2574.0 dmlc_topi==1.0.2574.0 dmlc_tvm==1.0.2574.0 inferentia_hwm==1.0.1362.0 islpy==2018.2

[1.0.12696.0]#

Date 5/11/2020

Summary#

Bug fixes and some functional and performance improvements to several neural networks.

New in this release#

  • This version supports compilation of unmodified Tensorflow BERT with batch size 1, 4, 6 for input sequence 128.

  • Improved Tensorflow BERT batch 4 sequence 128 performance to 45% of the accelerator peak (from 34%).

  • Support for MXNET BERT base batch 8 compilation

  • Support for TF Resnet152 batch 2 compilation

  • Most compiler messages are migrated from cout to logging mechanisms with verbosity control

Resolved Issues#

  • Fixed failure to compile unmodified Tensorflow BERT model for small batches

  • Fixed run-to-run-variability in OneHot operator implementation

  • Robustness improvements for ParallelWavenet and transformer decoder networks

Other Notes#

Dependencies#

dmlc_nnvm==1.0.2356.0
dmlc_topi==1.0.2356.0
dmlc_tvm==1.0.2356.0
inferentia_hwm==1.0.1294.0
islpy==2018.2

[1.0.9410.0]#

Date 3/26/2020

Summary#

Bug fixes and some functional and performance improvements to several neural networks.

New in this release#

  • Support compilation of modified SSD-300 (Running SSD300 with AWS Neuron)

  • Improved inference performance in natural language processing networks (such as prosotron encoder) by 45%

Resolved Issues#

  • Eliminated redundant fp32 to bfloat16 cast on input and output tensors

Known issues and limitations#

  • See previous releases.

Other Notes#

  • Added support for faster iteration on recurrent networks (aka auto-loop)

Dependencies#

dmlc_nnvm==1.0.2049.0
dmlc_topi==1.0.2049.0
pip install --upgrade dmlc_tvm==1.0.2049.0
inferentia_hwm==1.0.897.0
islpy==2018.2

[1.0.7878.0]#

Date 2/27/2020

Summary#

Bug fixes and minor performance improvements.

New in this release#

None

Resolved Issues#

  • Corrected image resize operator functionallity

  • Compiler internal enhancements made that will benefit models such as BERT

Known issues and limitations#

  • See previous releases.

Other Notes#

Dependencies#

dmlc_nnvm-1.0.1826.0
dmlc_topi-1.0.1826.0
dmlc_tvm-1.0.1826.0
inferentia_hwm-1.0.897.0
islpy-2018.2

[1.0.6801.0]#

Date 1/27/2020

Summary#

Bug fixes and some performance enhancement related to data movement for BERT-type neural networks.

New in this release#

None

Resolved Issues#

  • Improved throughput for operators processed in the Neuron Runtime CPU. As an example: execution of 4 single NeuronCore NEFF models of ResNet50 v2 float16 batch = 5 in parallel on an inf1.1xlarge sped up by 30%.

  • Corrected shape handling in Gather(TensorFlow)/Take(MXNet) operators that are processed by the Neuron Runtime in the Neuron Runtime vCPU, which resolves a possible crash in Neuron Compiler when compiling models with these operators with some shapes.

  • Added support for TensorFlow OneHot operator (as a Neuron Runtime CPU operator).

  • Added more internal checking for compiler correctness with newly defined error messages for this case.

“Internal ERROR: Data race between Op1 'Name1(...) [...]' and Op2 'Name2(...) [...]'”
  • Fixed out-of-memory issue introduced in 1.0.5939.0 such that some large models (BERT) compiled on instances with insufficient host memory would cause the runtime to crash with an invalid NEFF. This is actually a compiler error, but due to additional script layers wrapping this in the Running TensorFlow BERT-Large with AWS Neuron, this would have likely been seen as a runtime error like this:

2020-01-09 13:40:26.002594: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: neff is invalid
2020-01-09 13:40:26.002637: E tensorflow/core/common_runtime/executor.cc:642] Executor failed to create kernel. Invalid argument: neff is invalid
[[{{node bert/NeuronOp}}]]

Known issues and limitations#

See previous release notes. Some tutorials show use of specific compiler options and flags, these are needed to help provide guidance to the compiler to achieve best performance in specific cases. Please do not use in cases other than as shown in the specific tutorial as results may not be defined. These options should be considered beta and will be removed over time.

Other Notes#

Dependencies#

dmlc_nnvm-1.0.1619.0
dmlc_topi-1.0.1619.0
dmlc_tvm-1.0.1619.0
inferentia_hwm-1.0.839.0
islpy-2018.2

[1.0.5939.0]#

Date 12/20/2019

Summary#

Bug fixes and some performance enhancement for NeuronCore Pipeline.

New in this release#

Resolved Issues#

  • Fixed pipeline execution on more than 10 NeuronCores

  • Improved NeuronCores Pipeline execution by improving data exchange efficiency between NeuronCores

  • Added warning for unaligned memory access

  • Fixed handling of cast on input FP32 tensor

  • Improved handling of data layouts and transpose

  • Improved dead-code elimination

  • Improved efficiency of compute engine synchronization

  • Improved efficiency of data transfers within the Neuron code

Known issues and limitations#

See previous release notes. Some tutorials show use of specific compiler options and flags, these are needed to help provide guidance to the compiler to achieve best performance in specific cases. Please do not use in cases other than as shown in the specific tutorial as results may not be defined. These options should be considered beta and will be removed over time.

Other Notes#

Dependencies#

  • dmlc_nnvm-1.0.1416.0

  • dmlc_topi-1.0.1416.0

  • dmlc_tvm-1.0.1416.0

  • inferentia_hwm-1.0.720.0

  • islpy-2018.2

[1.0.5301.0]#

Date 12/1/2019

Summary#

New in this release#

Resolved Issues#

  • Added warning for unsupported operators and convolution sizes

  • Added warning for unsupported layout / upsampling

  • Added support for Relu6, AddV2, BatchMatmulV2 operators

  • Added support for default MXNet outputs in –io-config

  • Improved performance of batched inference for convolutional networks

  • Fixed MatMult column size 1

  • Fixed bf16 constant loading

  • Fixed Conv2D tile accumulation

Known Issues and Limitations#

See previous release notes. Resolved issues are shown in Resolved Issues.

Other Notes#

Please install g++ on AMIs without g++ pre-installed (i.e. server AMIs):

# Ubuntu
sudo apt-get install -y g++
# Amazon Linux
sudo yum nstall -y gcc-c++

Supported Python versions:

  • 3.5, 3.6, 3.7

Supported Linux distributions:

  • Ubuntu 16, Ubuntu 18, Amazon Linux 2

Dependencies#

  • dmlc_nnvm-1.0.1328.0

  • dmlc_topi-1.0.1328.0

  • dmlc_tvm-1.0.1328.0

  • inferentia_hwm-1.0.674.0

  • islpy-2018.2

[1.0.4680.0]#

Date: 11/25/2019

New in this release#

N/A, this is the first release.

Resolved issues#

N/A, this is the first release.

Known issues and limitations#

  1. Control flow Inferentia has a limited support for control flow. In general, Neuron can only support control flow operators which are static at compile time, i.e. static length RNN, top-k, sort, …

  2. Size of neural network The size of neural network is influenced by a) type of neural network (CNN, LSTM, MLP) , b) number of layers, c) sizes of input (dimension of the tensors, batch size, …). The current Neuron compiler release has a limitation in terms of the size of neural network it could effectively optimize. As a result, we limit CNN models (e.g. ResNet) to have an input size of up to 480x480 FP16, batch size of 4; LSTM models (e.g. GNMT) are limited to a time step limit of up to 900; MLP models (like BERT) are limited up to sequence-length equal 128, batch=8.

  3. Data layout The Neuron compiler supports multiple data layout formats (NCHW, NHWC, …). Non-CNHW input/output data-layouts will require Neuron to insert additional transpose operations, causing a degradation in performance.

  4. Object detection models Computer-vision object detection and segmentation models are not supported by the current release.

  5. Reduce data type INT8 data type is not currently supported by the Neuron compiler.

  6. Tensor residency When a sub-graph that is executed on the host is communicating with a sub-graph that is executing on Neuron cores, tensors are copied via the communication queues between the host and Inferentia memory for each inference, which may result in end-to-end performance degradation.

  7. Primary inputs in NeuronCore Pipeline mode When a neural network is executed in NeuronCore Pipeline mode, only the first operator in a neural network can receive primary inputs from the host.

Other Notes#

Dependencies#

  • nnvm: dmlc_nnvm-1.0.1219.0

  • topi: dmlc_topi-1.0.1219.0

  • tvm: dmlc_tvm-1.0.1219.0

  • hwm: inferentia_hwm-1.0.602.0

  • islpy: islpy-2018.2+aws2018.x.73.0

This document is relevant for: Inf1