Inferentia Model Architecture Fit¶
This section describes what types of deep learning Architectures perform well out of the box on Inferentia. It provides guidance on how Neuron maps operations to Inferentia, and discuss techniques you can use to optimize your deep learning models for Inferentia.
AWS Neuron, the SDK of Inferentia, enables you to deploy a wide range of petrained deep learning models on AWS machine learning (ML) chips. Neuron includes a deep learning compiler, a runtime and tools natively integrated into popular ML frameworks like TensorFlow, PyTorch and Apache MXNet (Incubating).
Many popular models used in today’s leading AI applications run out-of-the box on Inferentia. The following models are examples of model types that perform well on Inferentia:
Transformers based Natural Language Processing/Understanding (NLP/NLU) such as HuggingFace Transformers BERT, distilBERT, XLM-BERT, Roberta and BioBert. To get started with NLP models you can refer to Neuron PyTorch, TensorFlow and MXNet NLP tutorials.
Generative language models like MarianMT, Pegasus and Bart.
Computer Vision Models
Recommender engines models that include Embeddings and MLP layers.
The following points provide guidelines for deploying a model that doesn’t fit into one of the above categories or when deploying your own custom models. We encourage you to compile and run the model on Inferentia and contact us for support, if needed.
Neuron has wide support for operator types for popular model types. That said, with Neuron Auto partition feature it is not required that all operators are supported by Neuron to successfully deploy a model on Inferentia.
Prior to compilation, the Neuron extension in the given Framework will examine the supported operators in the model and then partition the model graph, creating subgraph(s) that contain the unsupported operators that will execute within the framework on the CPU instance, or subgraph(s) that contain the supported operators that will execute within the accelerator on Inferentia.
While many models perform very well with subgraphs running on CPU, especially if the operations map well to CPU execution, it is possible that the performance will not meet your application needs. In such cases, we encourage you to contact us for further optimization.
With Neuron, the input size shape is fixed at compile time. If your application requires multiple input sizes, we recommend using padding or bucketing techniques. Padding requires you to compile your models to the largest expected input size, and test your application performance. If performance is not within your targets, you can consider implementing a bucketing scheme. With bucketing, you compile your model to a few input size categories that represent the range of possible input sizes. with some applications, bucketing will help optimize compute utilization compared to padding, especially if small input sizes are more frequent than large input sizes. If the varying input dimension is the batch size, dynamic batching can be used in TensorFlow-Neuron and PyTorch-Neuron to do inference using larger batch size than the compiled batch size, (see Neuron Batching).
Models that contain control flow operators (see Transformers MarianMT Tutorial) may require specific handling to ensure successful compilation with Neuron.
Currently it is required that all tensor shapes (dimension sizes) in the compute-graph are known at compilation time. Model compilation with shapes that cannot be determined at compile time will fail.
For additional resources see: