General Neuron FAQs

What ML models types and operators are supported by AWS Neuron?

AWS Neuron includes a compiler that converts your trained machine learning models to a binary object for execution (aka Neuron Executable File Format or a NEFF file in short). The Neuron compiler supports many commonly used machine learning operators used in computer vision, natural language processing, recommender engines and more. A list of supported ML operators and supported inputs are in Neuron Supported operators .

It’s important to mention that to get good performance doens’t require all of the model operators to run on the chip. In many cases, some of the operators will continue to run on the instance CPUs, like the case of embeddings or image pre-processing, and will still provide a compelling end to end performance. We call this approach auto-partitioning, where the Neuron compiler optimizes the model execution based on operators that are most suitable to run on the CPU or the chip.

We constantly add more operators based on customers’ feedback.

Why is a compiler needed, and how do I use it?

The Neuron compiler converts from a framework level Neural Network graph, with operators like convolution and pooling, into a hardware-specific instruction set, builds the schedule for execution of these instructions, and converts the model parameters into format that the chip can consume. The supported input formats include TensorFlow, PyTorch, and MXNet. The output from the compiler is a Neuron Executable File Format (NEFF) artifact. The NEFF contains a combination of binary code, the model parameters, and additional meta-data needed by the Neuron runtime and profiler.

I am using a ML framework today – what will change for me to use this?

To use Inferentia within the Inf1 instances, the developer need to perform one-time compilation of the pre-trained model to generate a NEFF, and use this as the inference model in fleet of Inf1 instances.

What is a NeuronCore Pipeline ? and How do I take advantage of it?

A NeuronCore Pipeline is a unique technique to shard a specific Neural Network across multiple NeuronCores, to take advantage of the large on-chip cache instead of moving data in and out of external memory. The result is an increased throughput and reduce latency typically important for real-time inference applications. All Inf1 instances support it, and the Inf1 instances with multiple Inferentia accelerators, such as inf1.6xlarge or inf1.24xlarge support it thanks to the fast chip-to-chip interconnect.

Developers can choose to use NeuronCore Pipeline mode during compile stage, with an opt-in flag. Neuron Compiler provides further details.

NeuronCores, NeuronCore Groups and NeuronCore Pipelines: What do they do?

Each Inferentia chip has four compute engines called NeuronCores. A NeuronCore Group is a way to aggregate NeuronCores to increase hardware utilization and assign models with the right compute sizing for a specific application. If you want to run mutliple models in parallel, you can assign different models to separate NeuronCore Groups. A model compiled to use multiple NeuronCores in a NeuronCore Pipeline can be assigned to a NeuronCore Group with enough NeuronCores to load into. Finally- it is also possible for sets of Inferentia devices to be mapped to separate Neuron Runtimes. Neuron Features section has more information and examples.

Can I use TensorFlow networks from as-is ? if not, what should I do?

Yes. Models format can be imported into TensorFlow, either as a standard model-server, in which case it appears as a simple command line utility, or via the Python based TensorFlow environment. The primary additional step needed is to compile the model into Inferentia NEFF format.