This document is relevant for: Inf1, Inf2, Trn1, Trn1n

Inf2 Performance#

Table of contents

Last update: Apr 12th, 2023

Inference Performance #

Throughput optimized

Model	Scripts	Framework	Inst. Type	Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
albert-base-v2	Benchmark	PyTorch 1.13.0	Inf2.xlarge	2438	3.15	5.25	$0.086	Batch	2.9.0	Data Parallel	4	FP32	Matmult-BF16
bert-base-cased	Benchmark	PyTorch 1.13.0	Inf2.xlarge	2599	6.13	6.45	$0.081	Batch	2.9.0	Data Parallel	8	FP32	Matmult-BF16
bert-base-cased-finetuned-mrpc	Benchmark	PyTorch 1.13.0	Inf2.xlarge	2978	5.33	5.7	$0.071	Batch	2.9.0	Data Parallel	8	FP32	Matmult-BF16
bert-large-cased	Benchmark	PyTorch 1.13.0	Inf2.xlarge	866	18.13	21.47	$0.243	Batch	2.9.0	Data Parallel	8	FP32	Matmult-BF16
distilbert-base-cased	Benchmark	PyTorch 1.13.0	Inf2.xlarge	3721	8.96	11.61	$0.057	Batch	2.9.0	Data Parallel	4	FP32	Matmult-BF16
opt-13b		PyTorch 1.13.0	Inf2.48xlarge	1355	141.6	151.9	$2.661	Batch	2.9.0	Tensor Parallel	5
opt-30b		PyTorch 1.13.0	Inf2.48xlarge	627	82.6	106.9	$5.752	Batch	2.9.0	Tensor Parallel	64
opt-66b		PyTorch 1.13.0	Inf2.48xlarge	733	248.6	257.8	$4.917	Batch	2.9.0	Tensor Parallel	256
roberta-base	Benchmark	PyTorch 1.13.0	Inf2.xlarge	2379	3.26	4.43	$0.089	Batch	2.9.0	Data Parallel	4	FP32	Matmult-BF16
roberta-large	Benchmark	PyTorch 1.13.0	Inf2.xlarge	886	8.86	10.61	$0.238	Batch	2.9.0	Data Parallel	4	FP32	Matmult-BF16

Latency optimized

Model	Scripts	Framework	Inst. Type	Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
albert-base-v2	Benchmark	PyTorch 1.13.0	Inf2.xlarge	1649.38	1.19	1.53	$0.128	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16
bert-base-cased	Benchmark	PyTorch 1.13.0	Inf2.xlarge	1730.86	1.14	1.37	$0.122	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16
bert-base-cased-finetuned-mrpc	Benchmark	PyTorch 1.13.0	Inf2.xlarge	1885.25	1.05	1.17	$0.112	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16
bert-large-cased	Benchmark	PyTorch 1.13.0	Inf2.xlarge	647.61	3.07	3.43	$0.325	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16
distilbert-base-cased	Benchmark	PyTorch 1.13.0	Inf2.xlarge	2612.46	0.72	1.15	$0.081	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16
opt-13b		PyTorch 1.13.0	Inf2.48xlarge	36.1	28.0	28.3	$99.885	Real Time	2.9.0	Tensor Parallel	1
opt-30b		PyTorch 1.13.0	Inf2.48xlarge	20.7	48.4	50.1	$174.195	Real Time	2.9.0	Tensor Parallel	1
opt-66b		PyTorch 1.13.0	Inf2.48xlarge	14.9	65.7	74.2	$242.002	Real Time	2.9.0	Tensor Parallel	1
roberta-base	Benchmark	PyTorch 1.13.0	Inf2.xlarge	1726.45	1.14	1.42	$0.122	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16
roberta-large	Benchmark	PyTorch 1.13.0	Inf2.xlarge	628.55	3.17	3.52	$0.335	Real Time	2.9.0	Data Parallel	1	FP32	Matmult-BF16

Note

See Neuron Glossary for abbreviations and terms

This document is relevant for: Inf1, Inf2, Trn1, Trn1n