This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
Comparison of torch-neuron (Inf1) versus torch-neuronx (Inf2 & Trn1) for Inference#
Neuron now supports multiple instance types for inference. The choice of instance should be motivated primarily by the performance needs of the application, the instance pricing, and model compatibility.
In prior releases, torch-neuron only supported inference and torch-neuronx only supported training. While torch-neuron will never be updated to support training, torch-neuronx now supports both inference and training.
Note
Recommendation: Continue using torch-neuron (Inf1) for existing inference applications.
torch-neuronx (Inf2 & Trn1) should be used for inference applications that require very low latency, distributed inference, and large models that would not otherwise work with Inf1. See: Neuron performance.
Framework Comparison#
Example#
The following scripts are identical model compilations performed using each framework. The lines that are changed are highlighted to show where the differences occur.
import torch
import torchvision
import torch_neuron
model = torchvision.models.resnet50(pretrained=True).eval()
image = torch.rand(1, 3, 224, 224)
trace = torch_neuron.trace(model, image)
import torch
import torchvision
import torch_neuronx
model = torchvision.models.resnet50(pretrained=True).eval()
image = torch.rand(1, 3, 224, 224)
trace = torch_neuronx.trace(model, image)
Hardware Features#
The torch-neuron framework supports Inf1 instances and the torch-neuronx framework supports Inf2 & Trn1 instances. These instances have different architectures, networking configurations, and capabilities due to the NeuronCore versions used.
Models compiled with torch-neuron produce artifacts which are only compatible with NeuronCore-v1. Models compiled with torch-neuronx produce artifacts which are only compatible with NeuronCore-v2. This also means that models that were previously compiled with torch-neuron for Inf1 are not forwards compatible with Inf2 & Trn1 instances. Likewise, models compiled with torch-neuronx for Inf2 & Trn1 are not backwards compatible with Inf1.
NeuronCore-v2 is capable of higher throughput and lower latency than NeuronCore-v1 due to more powerful compute engines and improved memory bandwidth. NeuronCore-v2 can also support larger models since more memory is available per NeuronCore. The hardware differences between NeuronCore versions means that models compiled with torch-neuronx will usually outperform models compiled with torch-neuron.
In cases where throughput may be similar across instance-types, instances using NeuronCore-v2 tend to achieve significantly lower latency than instances using NeuronCore-v1. This can enable applications that require extremely fast response time.
See the Neuron performance page for the most up-to-date performance metrics.
Besides performance benefits, NeuronCore-v2 also has more hardware capabilities compared to NeuronCore-v1. For example, NeuronCore-v2 supports a greater variety of data types and introduces a new fully programmable GPSIMD-Engine.
Note that Trn
instance-types are optimized for training purposes. Some
Trn
features (such as inter-chip networking) may be unnecessary
for inference applications that do not require distribution across multiple
NeuronCores.
Software Features#
The torch-neuron framework uses torch_neuron.trace()
to
create a TensorFlow GraphDef protobuf intermediate representation (IR) of the
model compute graph. This is compiled to a binary Neuron Executable File Format
(NEFF) with the neuron-cc compiler.
The torch-neuronx framework uses torch_neuronx.trace()
with
torch-xla to create a HloModule protobuf IR of the model compute graph. This is
compiled to a binary executable NEFF with the neuronx-cc compiler.
The use of different compiler versions means that separate flags are supported by each framework. For example:
NeuronCore Pipeline is supported in neuron-cc but is not supported in neuronx-cc. However, this feature is much less useful when using the NeuronCore-v2 architecture due to significant memory improvements.
Mixed precision flags will differ across the compilers. neuronx-cc improves the flags by making the behavior more explicit and streamlined:
Since the python graph recording methods used by the frameworks are much different, this may lead to different levels of model support. To view the models which are known to be working, many compilation samples are provided for each framework:
Framework model support may also be affected by the graph partitioning feature.
In torch-neuron, the torch_neuron.trace()
API provides the ability to
fall back to CPU for operations that are not supported directly by Neuron. This
fallback behavior is currently not supported by torch_neuronx.trace()
,
however, certain operations that were previously not well-supported
in torch-neuron are now supported in torch-neuronx by default (e.g.
torch.nn.Embedding
).
Feature Summary#
torch-neuron |
torch-neuronx |
|
---|---|---|
Supported Instances |
||
Inference Support |
Yes |
Yes |
Training Support |
No |
Yes |
Architecture |
||
Model Support |
Architecture Fit NeuronCore-v1 |
Architecture Fit NeuronCore-v2 |
Trace API |
||
NeuronCore Pipeline |
Yes |
No |
Partitioning |
Yes |
No |
IR |
GraphDef |
HLO |
Compiler |
||
Samples |
References#
To determine if a model is already supported in a given framework, it is recommended to check the existing documentation for specific models. In order of reference quality, the following pages can be checked prior to compiling a model:
Neuron performance: Models that are available here have been optimized to maximize throughput and/or latency. These metrics are updated frequently as improvements are made. Since metrics are published for different instance types, this can provide a direct performance comparison between instances. Note that the exact models and configurations may differ across instances.
Neuron GitHub Samples: Provides simple examples of compiling and executing models. Compared to the benchmarks, this reference is only intended to show how to run a particular model on Neuron. This only validates if a framework supports a given model.
model_architecture_fit: If the a model is not listed on the prior pages, it may be that the model has not been tested or may not be well-supported. The architecture fit page provides high-level guidelines for which kinds of models will work well based on the hardware capabilities.
If a model does not appear in any of these references, the last option is to attempt to compile the model to see how it performs. In the case that an error occurs during compilation, please file a ticket in the Neuron SDK Github Issues.
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2