Inf1 Inference Performance
Contents
This document is relevant for: Inf1
Inf1 Inference Performance#
Table of contents
The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.
Last update: November 23, 2022
Natural Language Processing#
Model |
Scripts |
Framework |
Inst. Type |
Avg Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT base (bert-base-cased) |
PyTorch 1.11.0 |
inf1.xlarge |
1179 |
41 |
42 |
$0.054 |
Batch |
2.5.0 |
Data Parallel |
6 |
fp32, sequence-length=128 |
|
BERT base (bert-base-uncased) |
PyTorch 1.11.0 |
inf1.xlarge |
1180 |
41 |
41 |
$0.054 |
Batch |
2.5.0 |
Data Parallel |
6 |
fp32, sequence-length=128 |
|
DistilBERT base (distilbert-base-uncased) |
PyTorch 1.11.0 |
inf1.xlarge |
1948 |
25 |
25 |
$0.033 |
Batch |
2.5.0 |
Data Parallel |
6 |
fp32, sequence-length=128 |
|
DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english) |
PyTorch 1.11.0 |
inf1.xlarge |
1953 |
29 |
29 |
$0.032 |
Batch |
2.5.0 |
Data Parallel |
7 |
fp32, sequence-length=128 |
|
DistilRoBERTa base (distilroberta-base) |
PyTorch 1.11.0 |
inf1.xlarge |
2134 |
22 |
25 |
$0.030 |
Batch |
2.5.0 |
Data Parallel |
6 |
fp32, sequence-length=128 |
*Throughput and latency numbers in this table were computed using NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.
Model |
Tutorial |
Framework |
Inst. Type |
Avg Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT base |
PyTorch 1.11.0 |
inf1.xlarge |
966 |
21 |
23 |
$0.066 |
Batch |
2.5.0 |
Data Parallel |
6 |
fp32, bert-base-cased-finetuned-mrpc, sequence-length=128 |
|
BERT base |
Tensorflow 2.8 |
inf1.6xlarge |
3036 |
42 |
43 |
$0.108 |
Batch |
2.5.0 |
Data Parallel |
128 |
fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128 |
|
BERT base |
PyTorch 1.11.0 |
inf1.6xlarge |
1996 |
6 |
6 |
$0.164 |
Real Time |
2.5.0 |
Model Pipeline |
1 |
fp32, bert-base-uncased, sequence-length=128 |
*Throughput and latency numbers in this table were generated using Neuron Tutorials.
Computer Vision#
Model |
Tutorial |
Framework |
Inst. Type |
Avg Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenPose |
Tensorflow 1.15 |
inf1.xlarge |
57 |
60 |
67 |
$1.105 |
Real Time |
2.5.0 |
Data Parallel |
1 |
fp16 |
|
Resnet-50 |
Tensorflow 1.15 |
inf1.xlarge |
2162 |
18 |
23 |
$0.029 |
Batch |
2.5.0 |
Data Parallel |
10 |
fp16 |
|
Resnet-50 |
PyTorch 1.11.0 |
inf1.xlarge |
947 |
21 |
22 |
$0.067 |
Batch |
2.5.0 |
Data Parallel |
5 |
fp32 |
|
YOLO v4 |
PyTorch 1.11.0 |
inf1.2xlarge |
177 |
41 |
55 |
$0.569 |
Real Time |
2.5.0 |
Data Parallel |
1 |
fp32 |
*Throughput and latency numbers in this table were generated using Neuron Tutorials.
Note
Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Inf1