Model Architecture Fit Guidelines
Contents
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn1n
Model Architecture Fit Guidelines#
Table of contents
Introduction#
AWS Neuron SDK enables you to train and deploy a wide range of deep learning models on EC2 Inf1, EC2 inf2, EC2 Trn1 and EC2 Trn1n instances, which are powered by Inferentia, Inferentia2 and Trainium devices. The below table provides details of the NeuronDevices and NeuronCores enabling each instance:
Instance |
NeuronDevices |
NeuronCores |
# NeuronCores in a NeuronDevice |
---|---|---|---|
16 x Trainium |
32 x NeuronCore-v2 |
2 |
|
16 x Trainium |
32 x NeuronCore-v2 |
2 |
|
12 x Inferentia2 |
24 x NeuronCore-v2 |
2 |
|
16 x Inferentia |
64 x NeuronCore-v1 |
4 |
This document describes what types of deep learning model architectures are a good fit for Inferentia, Inferentia2 and Trainium powered instances.
Model Support Overview#
AWS Trainium and AWS Inferentia2 (NeuronCore-v2)#
Last update - 02/25/2023
Model Family/ Neural Network Architecture |
Category |
Hardware Architecture |
Training with PyTorch Neuron ( |
Inference with PyTorch Neuron ( |
Inference with TensorFlow Neuron ( |
---|---|---|---|---|---|
Transformer Encoders |
NLP |
Good Fit |
Supported |
Supported |
Supported |
Transformer Decoders |
NLP |
Good Fit |
Supported |
Supported |
|
Transformer Encoder-Decoder (Sequence-to-sequence) |
NLP |
Good Fit |
Supported |
||
LSTMs |
NLP and Computer Vision |
Good Fit |
|||
Vision Transformer |
Computer Vision |
Good Fit |
Supported |
||
Diffusion models |
Computer Vision |
Good Fit |
|||
Convolutional Neural Network (CNN) models |
Computer Vision |
Good Fit |
Supported |
||
R-CNNs |
Computer Vision |
Good Fit |
Note
Supported means that at least a single model of the model family or the neural-network architecture already enabled.
AWS Inferentia (NeuronCore v1)#
Last update - 10/10/2022
Model Family/ Neural Network Architecture |
Category |
Hardware Architecture |
PyTorch Neuron ( |
TensorFlow Neuron ( |
TensorFlow Neuron ( |
---|---|---|---|---|---|
Transformer Encoders |
NLP |
Good Fit |
Supported |
Supported |
Supported |
Transformer Decoders |
NLP |
Not a Good Fit |
NA |
NA |
NA |
Transformer Encoder-Decoder (Sequence-to-sequence) |
NLP |
Not a Good Fit |
NA |
NA |
NA |
LSTMs |
NLP and Computer Vision |
Good Fit |
Supported |
NA |
NA |
Vision Transformer |
Computer Vision |
Good Fit |
|||
Diffusion models |
Computer Vision |
Good Fit |
NA |
NA |
|
Convolutional Neural Network (CNN) models |
Computer Vision |
Good Fit |
Supported |
Supported |
|
R-CNNs |
Computer Vision |
Supported with limitations |
Supported with limitations |
NA |
NA |
Note
Supported means that at least a single model of the model family or the neural-network architecture already enabled.
Clarifications on Inferentia (1st generation) Model Architecture#
Natural Language Processing (NLP) Models with Transformer#
Transformer Encoders#
Autoencoding models use only the encoder part of the Transformer architecture. Representatives of this family include models like BERT, distilBERT, XLM-BERT, Roberta, BioBert, etc. Since the encoding process in these models can be parallelized, you can expect these models to run well both on Inferentia and Trainium.
Architecture Fit - Autoencoding models are a good fit for Inferentia.
Neuron Support - Neuron SDK support running Autoencoding models for inference on Inferentia. Please see benchmark results of these models. To get started with NLP models you can refer to Neuron PyTorch, TensorFlow and MXNet NLP tutorials.
Decoder models, or autoregressive models with Transformer#
Autoregressive models keep only the decoder part of the Transformer architecture. Representatives of this family include models like GPT-3, GPT-2, etc.
Architecture Fit - Autoregressive models are not a good fit for Inferentia. Usually the decoder part in these models is the most significant performance bottleneck since it must be executed once per output token, causing frequent access to the memory. Due to this these models typically experience the best performance only when the decoder maximum sequence length is short (e.g., 128).
Neuron Support - Neuron SDK does not support Autoregressive models inference on Inferentia.
Encoder-decoder models, or sequence-to-sequence models with Transformer#
Sequence-to-sequence models use both of encoder and decoder of the Transformer architecture. Representatives of this family include models like T5, Bart, Marian MT, etc.
Architecture Fit - Sequence-to-sequence models are not a good fit for Inferentia. Like decoder models explained above, usually the decoder part in these sequence-to-sequence models is the most significant performance bottleneck since it must be executed once per output token, causing frequent access to the memory. Due to this, even when you enabled the models to run on Inferentia with wrapping the decoder part, these models typically experience the best performance only when the decoder maximum sequence length is short (e.g., 128).
Neuron Support - Neuron SDK does not support sequence-to-sequence models inference on Inferentia out of the box. However, you can run a model with defining wrappers around the encoder and decoder portions of it. For example, please refer to MarianMT tutorial on Inferentia for more details.
Computer Vision Models#
Convolutional Neural Network (CNN) based models#
CNN based models are used for applications in image classification and object detection. Representatives of this family include models like ResNet, ResNext, VGG, YOLO, SSD, etc.
Architecture Fit - CNN based models are a good fit for Inferentia.
Neuron Support - Neuron SDK supports CNN based models inference on Inferentia. Please see the benchmark results of these models. To get started with these models you can refer to Neuron PyTorch, TensorFlow and MXNet tutorials.
Region-based CNN (R-CNN) models#
Region-based CNNs (R-CNNs) models are commonly used for object detection and image segmentation tasks. Popular variants of the the R-CNN model include R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN.
Architecture Fit - R-CNN models can have a few limitations and considerations on Inferentia: RoI Align operators: At this time, RoI Align operators typically cannot run efficiently on NeuronCore v1. As a result, RoI Align operators are mapped directly to CPU during compilation. R-CNN models that predict a low number of bounding boxes (<100) experience the best performance on Inferentia. Large ResNet backbone: R-CNNs that have a large ResNet backbone (such as ResNet-50 or ResNet-101) experience the greatest performance improvement on Inferentia because a larger portion of the R-CNN compute is accelerated.
Neuron Support - Torch models must be traceable using
torch.jit.trace()
for compilation on Inferentia. Most Detectron2-based R-CNNs are not jit traceable by default, so they cannot readily be compiled for optimized inference on Inferentia. The Running R-CNNs on Inf1 application note demonstrates how to compile and improve the performance of R-CNN models on Inferentia. It also provides an end-to-end example of running a Detectron2 R-CNN on Inferentia.
Models with Long Short-Term Memory (LSTM) networks#
LSTMs use an internal state to process sequential data. LSTMs are commonly used to model temporal sequences of data in language processing and computer vision applications.
Architecture Fit - Models with LSTM cells are a good fit for Inferentia.
Neuron Support - Models with LSTM networks are supported on Inferentia, please see Developer Guide - PyTorch Neuron (torch-neuron) LSTM Support.
Diffusion Models#
Architecture Fit - Diffusion models are a good fit for Inferentia.
Neuron Support - Diffusion models are not supported on Inferentia as of the latest Neuron release. Please track the Neuron Roadmap for details.
Known Issues on Inferentia (NeuronCore v1)#
Support of large models (impacts torch-neuron and tensorflow-neuron (TF1.x))#
During compilation on Inferentia (NeuronCore v1), torch-neuron
and tensorflow-neuron (TF1.x)
export a protobuf that contains the model’s graph structure and weights. This causes an issue when the total size of the model’s weights exceeds the 2GB limitation of protobufs. As a result, customers who want to run large models such as RegNet, Stable Diffusion, and t5-11b might run into protobuf errors during compilation.
This is a known issue related to the compilation process, not a hardware-dependent issue. Allowing large models like this to be compiled for inference on Inferentia (NeuronCore v1) is a feature that we intend to address in a future release. Please track the Neuron Roadmap for details.
Note
Neuron release 2.5.0 added Experimental support for tracing models larger than 2GB in `tensorflow-neuron (TF2.x)`, please see extract-weights
flag in TensorFlow 2.x (tensorflow-neuron) Tracing API
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn1n