This document is relevant for: Inf1, Inf2, Trn1, Trn1n

Inf2 Performance#

Table of contents

Last update: Apr 12th, 2023

Inference Performance#

Model

Scripts

Framework

Inst. Type

Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

albert-base-v2

Benchmark

PyTorch 1.13.0

Inf2.xlarge

2438

3.15

5.25

$0.086

Batch

2.9.0

Data Parallel

4

FP32

Matmult-BF16

bert-base-cased

Benchmark

PyTorch 1.13.0

Inf2.xlarge

2599

6.13

6.45

$0.081

Batch

2.9.0

Data Parallel

8

FP32

Matmult-BF16

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 1.13.0

Inf2.xlarge

2978

5.33

5.7

$0.071

Batch

2.9.0

Data Parallel

8

FP32

Matmult-BF16

bert-large-cased

Benchmark

PyTorch 1.13.0

Inf2.xlarge

866

18.13

21.47

$0.243

Batch

2.9.0

Data Parallel

8

FP32

Matmult-BF16

distilbert-base-cased

Benchmark

PyTorch 1.13.0

Inf2.xlarge

3721

8.96

11.61

$0.057

Batch

2.9.0

Data Parallel

4

FP32

Matmult-BF16

opt-13b

PyTorch 1.13.0

Inf2.48xlarge

1355

141.6

151.9

$2.661

Batch

2.9.0

Tensor Parallel

5

opt-30b

PyTorch 1.13.0

Inf2.48xlarge

627

82.6

106.9

$5.752

Batch

2.9.0

Tensor Parallel

64

opt-66b

PyTorch 1.13.0

Inf2.48xlarge

733

248.6

257.8

$4.917

Batch

2.9.0

Tensor Parallel

256

roberta-base

Benchmark

PyTorch 1.13.0

Inf2.xlarge

2379

3.26

4.43

$0.089

Batch

2.9.0

Data Parallel

4

FP32

Matmult-BF16

roberta-large

Benchmark

PyTorch 1.13.0

Inf2.xlarge

886

8.86

10.61

$0.238

Batch

2.9.0

Data Parallel

4

FP32

Matmult-BF16

Model

Scripts

Framework

Inst. Type

Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

albert-base-v2

Benchmark

PyTorch 1.13.0

Inf2.xlarge

1649.38

1.19

1.53

$0.128

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

bert-base-cased

Benchmark

PyTorch 1.13.0

Inf2.xlarge

1730.86

1.14

1.37

$0.122

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 1.13.0

Inf2.xlarge

1885.25

1.05

1.17

$0.112

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

bert-large-cased

Benchmark

PyTorch 1.13.0

Inf2.xlarge

647.61

3.07

3.43

$0.325

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

distilbert-base-cased

Benchmark

PyTorch 1.13.0

Inf2.xlarge

2612.46

0.72

1.15

$0.081

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

opt-13b

PyTorch 1.13.0

Inf2.48xlarge

36.1

28.0

28.3

$99.885

Real Time

2.9.0

Tensor Parallel

1

opt-30b

PyTorch 1.13.0

Inf2.48xlarge

20.7

48.4

50.1

$174.195

Real Time

2.9.0

Tensor Parallel

1

opt-66b

PyTorch 1.13.0

Inf2.48xlarge

14.9

65.7

74.2

$242.002

Real Time

2.9.0

Tensor Parallel

1

roberta-base

Benchmark

PyTorch 1.13.0

Inf2.xlarge

1726.45

1.14

1.42

$0.122

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

roberta-large

Benchmark

PyTorch 1.13.0

Inf2.xlarge

628.55

3.17

3.52

$0.335

Real Time

2.9.0

Data Parallel

1

FP32

Matmult-BF16

Note

See Neuron Glossary for abbreviations and terms

This document is relevant for: Inf1, Inf2, Trn1, Trn1n