Neuron Inference Performance¶
Table of Contents
The following tables contain the reference inference performance for models in the Neuron Tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.
Last update: October, 27th, 2021
Natural Language Processing¶
Model |
Tutorial |
Framework |
Inst. Type |
Avg Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT base |
PyTorch 1.9.1 |
inf1.xlarge |
966 |
22 |
23 |
$0.066 |
Batch |
1.16.0 |
Data Parallel |
6 |
fp32, bert-base-cased-finetuned-mrpc, sequence-length=128 |
|
BERT base |
PyTorch 1.9.1 |
inf1.6xlarge |
2037 |
6 |
6 |
$0.161 |
Real Time |
1.16.0 |
Model Pipeline |
1 |
fp32, bert-base-uncased, sequence-length=128 |
|
BERT base |
Tensorflow 2.5 |
inf1.6xlarge |
1035 |
123 |
125 |
$0.317 |
Batch |
1.16.0 |
Data Parallel |
128 |
fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128 |
Note
Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Computer Vision¶
Model |
Tutorial |
Framework |
Inst. Type |
Avg Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenPose |
Tensorflow 1.15 |
inf1.xlarge |
58 |
62 |
69 |
$1.101 |
Real Time |
1.16.0 |
Data Parallel |
1 |
fp16 |
|
Resnet-50 |
Tensorflow 1.15 |
inf1.xlarge |
2270 |
17 |
21 |
$0.028 |
Batch |
1.16.0 |
Data Parallel |
10 |
fp16 |
|
Resnet-50 |
PyTorch 1.9.1 |
inf1.xlarge |
919 |
22 |
23 |
$0.069 |
Batch |
1.16.0 |
Data Parallel |
5 |
fp32 |
|
YOLO v4 |
PyTorch 1.9.1 |
inf1.2xlarge |
185 |
42 |
61 |
$0.544 |
Real Time |
1.16.0 |
Data Parallel |
1 |
fp32 |
Note
Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.