This document is relevant for: Inf2

Inf2 Performance#

Last update: Apr 12th, 2023

Language Models Inference Performance#

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

albert-base-v2

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

2835.99

10.79

20.65

$0.074

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

bert-base-cased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

2838.14

5.58

6.15

$0.074

Batch

2.10.0

Data Parallel

8

128

FP32

Matmult-BF16

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Sequence Classification

3185.58

10.82

15.51

$0.066

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

bert-large-cased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

822.93

40.41

50.57

$0.256

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

bert-large-uncased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

797.58

20.48

27.88

$0.264

Batch

2.10.0

Data Parallel

8

128

FP32

Matmult-BF16

camembert-base

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

2883.78

11.08

11.32

$0.073

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

distilbert-base-cased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

5042.29

6.22

7.19

$0.042

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

google/electra-base-discriminator

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

2865.69

5.54

6.07

$0.073

Batch

2.10.0

Data Parallel

8

128

FP32

Matmult-BF16

gpt2

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1541.04

20.35

26.52

$0.137

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

gpt2-large

Benchmark

PyTorch 1.13.1

Inf2.48xlarge

Raw Output (AutoModel)

374.32

85.15

87.61

$9.633

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

gpt2-medium

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

765.13

41.22

47.85

$0.275

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

roberta-base

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

2845.24

5.58

6.12

$0.074

Batch

2.10.0

Data Parallel

8

128

FP32

Matmult-BF16

roberta-large

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

944.23

16.52

22.84

$0.223

Batch

2.10.0

Data Parallel

8

128

FP32

Matmult-BF16

xlm-roberta-base

Benchmark

PyTorch 1.13.1

Inf2.48xlarge

Raw Output (AutoModel)

30.12

1035.27

1270.77

$119.716

Batch

2.10.0

Data Parallel

16

128

FP32

Matmult-BF16

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

albert-base-v2

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1993.09

1.0

1.3

$0.106

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

bert-base-cased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1872.71

1.07

1.2

$0.112

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Sequence Classification

2018.15

0.98

1.11

$0.104

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

bert-large-cased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

676.43

2.92

3.29

$0.311

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

bert-large-uncased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

692.94

2.84

3.25

$0.304

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

camembert-base

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1851.02

1.07

1.2

$0.114

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

distilbert-base-cased

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

3095.17

0.64

0.76

$0.068

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

google/electra-base-discriminator

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1922.0

1.04

1.17

$0.110

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

gpt2

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1110.92

1.73

2.3

$0.190

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

gpt2-large

Benchmark

PyTorch 1.13.1

Inf2.48xlarge

Raw Output (AutoModel)

205.53

9.58

9.87

$17.544

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

gpt2-medium

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

437.6

4.46

5.42

$0.481

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

roberta-base

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

1851.43

1.08

1.2

$0.114

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

roberta-large

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Raw Output (AutoModel)

707.36

2.82

2.96

$0.298

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

xlm-roberta-base

Benchmark

PyTorch 1.13.1

Inf2.48xlarge

Raw Output (AutoModel)

25.39

78.92

92.39

$142.018

Real Time

2.10.0

Data Parallel

1

128

FP32

Matmult-BF16

Large Language Models Inference Performance#

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

opt-13b

Benchmark

Transformers Neuron

Inf2.48xlarge

Text Generation

2247.61

95.32

120.91

$1.604

Batch

2.10.0

Tensor Parallel

8

1

256

128

FP16

Matmult-BF16

opt-30b

Benchmark

Transformers Neuron

Inf2.48xlarge

Text Generation

1384.42

162.93

173.34

$2.605

Batch

2.10.0

Tensor Parallel

8

1

256

128

FP16

Matmult-BF16

opt-66b

Benchmark

Transformers Neuron

Inf2.48xlarge

Text Generation

1400.11

156.34

848.62

$2.575

Batch

2.10.0

Tensor Parallel

24

1

64

128

FP16

Matmult-BF16

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

opt-13b

Benchmark

Transformers Neuron

Inf2.48xlarge

Text Generation

35.66

27.82

29.42

$101.131

Real Time

2.10.0

Tensor Parallel

8

1

1

2048

FP16

Matmult-BF16

opt-30b

Benchmark

Transformers Neuron

Inf2.48xlarge

Text Generation

19.43

51.11

54.71

$185.533

Real Time

2.10.0

Tensor Parallel

8

1

1

2048

FP16

Matmult-BF16

opt-66b

Benchmark

Transformers Neuron

Inf2.48xlarge

Text Generation

14.33

68.92

74.33

$251.576

Real Time

2.10.0

Tensor Parallel

24

1

1

2048

FP16

Matmult-BF16

Vision Models Inference Performance#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Inferences per dollar

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

Stable Diffusion 2.1

512x512

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Generation

0.44

2286.0

2566.0

2,075/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion 2.1

768x768

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Generation

0.13

7956.0

8087.0

598/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

UNet

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Segmentation

850.53

18.76

19.33

4,039,442/$

Batch

2.10.0

Data Parallel

4

FP32

Matmult-BF16

google/vit-base-patch16-224

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

1403.75

223.53

258.99

6,666,868/$

Batch

2.10.0

Data Parallel

32

FP32

Matmult-BF16

resnet101

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

3052.69

83.81

83.94

14,498,273/$

Batch

2.10.0

Data Parallel

64

FP32

Matmult-BF16

resnet152

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

2310.55

110.68

110.86

10,973,598/$

Batch

2.10.0

Data Parallel

64

FP32

Matmult-BF16

resnet18

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

5793.64

5.52

5.6

27,515,973/$

Batch

2.10.0

Data Parallel

8

FP32

Matmult-BF16

resnet34

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

4570.57

7.0

7.06

21,707,197/$

Batch

2.10.0

Data Parallel

8

FP32

Matmult-BF16

resnet50

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

3896.43

8.21

8.28

18,505,472/$

Batch

2.10.0

Data Parallel

8

FP32

Matmult-BF16

vgg11

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

3347.13

75.79

76.56

15,896,674/$

Batch

2.10.0

Data Parallel

64

FP32

Matmult-BF16

vgg16

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

1781.93

17.95

18.06

8,462,983/$

Batch

2.10.0

Data Parallel

8

FP32

Matmult-BF16

Note

Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Inferences per dollar

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

Stable Diffusion 2.1

512x512

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Generation

0.44

2286.0

2566.0

2,075/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion 2.1

768x768

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Generation

0.13

7956.0

8087.0

598/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

UNet

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Segmentation

334.54

2.96

3.76

1,588,868/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

google/vit-base-patch16-224

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

358.36

2.81

3.72

1,701,983/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

resnet101

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

830.24

1.19

1.32

3,943,078/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

resnet152

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

694.81

1.43

1.53

3,299,903/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

resnet18

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

1147.19

0.88

1.0

5,448,377/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

resnet34

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

1034.72

0.95

1.3

4,914,233/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

resnet50

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

970.46

1.01

1.14

4,609,040/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

vgg11

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

1103.54

1.97

2.05

5,241,082/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

vgg16

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

418.54

2.38

2.48

1,987,779/$

Real Time

2.10.0

Data Parallel

1

FP32

Matmult-BF16

Note

Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Note

See Neuron Glossary for abbreviations and terms

This document is relevant for: Inf2