Inference with Neuron - FAQ
This document is relevant for:
Inference with Neuron - FAQ#
Table of contents
AWS Neuron includes a compiler that converts your trained machine learning models to a binary object for execution. The Neuron compiler supports many commonly used machine learning operators used in computer vision, natural language processing, recommender engines and more. A list of supported ML operators and supported inputs are in Neuron Supported operators .
It’s important to mention that to get good performance doesn’t require all of the model operators to run on the chip. In many cases, some of the operators will continue to run on the instance CPUs, like the case of embeddings or image pre-processing, and will still provide a compelling end to end performance. We call this approach auto-partitioning, where the Neuron compiler optimizes the model execution based on operators that are most suitable to run on the CPU or the chip.
For the latest model architecture support, please refer to the model architecuture fit and performance pages.
The Neuron compiler converts a model from a framework level Neural Network graph, with operators like convolution and pooling, into a Neuron Device-specific instruction set, builds the schedule for execution of these instructions, and converts the model parameters into format that the neuron device can consume. The supported input formats include TensorFlow, PyTorch, and MXNet. The output from the compiler is a Neuron Executable File Format (NEFF) artifact. The NEFF contains a combination of binary code, the model parameters, and additional meta-data needed by the Neuron runtime and profiler.
To use Inferentia within the Inf1 instances, the developer needs to perform one-time compilation of the pre-trained model to generate a NEFF, and use this as the inference model in fleet of Inf1 instances.
A NeuronCore Pipeline is a unique technique to shard a specific Neural Network across multiple NeuronCores, to take advantage of the large on-chip cache instead of moving data in and out of external memory. The result is an increased throughput and reduce latency typically important for real-time inference applications. All Inf1 instances support it, and the Inf1 instances with multiple Inferentia accelerators, such as inf1.6xlarge or inf1.24xlarge support it thanks to the fast chip-to-chip interconnect.
Developers can choose to use NeuronCore Pipeline mode during compile stage, with an opt-in flag. neuron-cc provides further details.
Each Inferentia chip has four compute engines called NeuronCores. A NeuronCore Group is a way to aggregate NeuronCores to increase hardware utilization and assign models with the right compute sizing for a specific application. If you want to run mutiple models in parallel, you can assign different models to separate NeuronCore Groups. A model compiled to use multiple NeuronCores in a NeuronCore Pipeline can be assigned to a NeuronCore Group with enough NeuronCores to load into. Finally- it is also possible for sets of Inferentia devices to be mapped to separate Neuron Runtimes. Neuron Features section has more information and examples.
Yes. Models format can be imported into TensorFlow, either as a standard model-server, in which case it appears as a simple command line utility, or via the Python based TensorFlow environment. The primary additional step needed is to compile the model into Inferentia NEFF format.
This document is relevant for: