This document is relevant for: Inf1

Inf1 Inference Performance#

Table of contents

Natural Language Processing
Computer Vision

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: November 23, 2022

Natural Language Processing #

Model	Scripts	Framework	Inst. Type	Avg Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
BERT base (bert-base-cased)	Compile + Benchmark	PyTorch 1.11.0	inf1.xlarge	1179	41	42	$0.054	Batch	2.5.0	Data Parallel	6	fp32, sequence-length=128
BERT base (bert-base-uncased)	Compile + Benchmark	PyTorch 1.11.0	inf1.xlarge	1180	41	41	$0.054	Batch	2.5.0	Data Parallel	6	fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased)	Compile + Benchmark	PyTorch 1.11.0	inf1.xlarge	1948	25	25	$0.033	Batch	2.5.0	Data Parallel	6	fp32, sequence-length=128
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)	Compile + Benchmark	PyTorch 1.11.0	inf1.xlarge	1953	29	29	$0.032	Batch	2.5.0	Data Parallel	7	fp32, sequence-length=128
DistilRoBERTa base (distilroberta-base)	Compile + Benchmark	PyTorch 1.11.0	inf1.xlarge	2134	22	25	$0.030	Batch	2.5.0	Data Parallel	6	fp32, sequence-length=128

*Throughput and latency numbers in this table were computed using NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.

Model	Tutorial	Framework	Inst. Type	Avg Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
BERT base	HuggingFace Pretrained BERT	PyTorch 1.11.0	inf1.xlarge	966	21	23	$0.066	Batch	2.5.0	Data Parallel	6	fp32, bert-base-cased-finetuned-mrpc, sequence-length=128
BERT base	HuggingFace distilBERT with Tensorflow2	Tensorflow 2.8	inf1.6xlarge	3036	42	43	$0.108	Batch	2.5.0	Data Parallel	128	fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128
BERT base	Using NeuronCore Pipeline	PyTorch 1.11.0	inf1.6xlarge	1996	6	6	$0.164	Real Time	2.5.0	Model Pipeline	1	fp32, bert-base-uncased, sequence-length=128

*Throughput and latency numbers in this table were generated using Neuron Tutorials.

Computer Vision #

Model	Tutorial	Framework	Inst. Type	Avg Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
OpenPose	Running OpenPose on Inferentia	Tensorflow 1.15	inf1.xlarge	57	60	67	$1.105	Real Time	2.5.0	Data Parallel	1	fp16
Resnet-50	ResNet-50 optimization example	Tensorflow 1.15	inf1.xlarge	2162	18	23	$0.029	Batch	2.5.0	Data Parallel	10	fp16
Resnet-50	Resnet50 model for Inferentia	PyTorch 1.11.0	inf1.xlarge	947	21	22	$0.067	Batch	2.5.0	Data Parallel	5	fp32
YOLO v4	Evaluate YOLO v4 on Inferentia	PyTorch 1.11.0	inf1.2xlarge	177	41	55	$0.569	Real Time	2.5.0	Data Parallel	1	fp32

*Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf1

AWS Neuron Documentation

Inf1 Inference Performance

Contents

Inf1 Inference Performance#

Natural Language Processing #

Computer Vision #

AWS Neuron Documentation

Inf1 Inference Performance

Contents

Inf1 Inference Performance#

Natural Language Processing#

Computer Vision#

Natural Language Processing #

Computer Vision #