This document is relevant for: Inf2

Inf2 Inference Performance#

Last update: April 11th, 2024

Encoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3039.05

5.25

5.34

$0.069

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2713.61

2.92

3.41

$0.078

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.1

Inf2.xlarge

Sequence Classification

3269.04

9.8

10.07

$0.064

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2845.18

5.57

7.33

$0.074

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

937.59

16.95

18.93

$0.225

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

936.66

8.47

9.71

$0.225

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2871.72

10.53

14.81

$0.073

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

4721.11

1.67

1.95

$0.045

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

4714.5

1.68

2.03

$0.045

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

4794.99

6.15

10.11

$0.044

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2949.72

10.76

14.1

$0.071

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2875.1

5.43

7.32

$0.073

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

992.13

7.94

8.98

$0.212

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.1

Inf2.48xlarge

Raw Output (AutoModel)

50.39

633.24

703.87

$71.558

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2045.85

0.97

1.1

$0.103

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1928.26

1.03

1.23

$0.109

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.1

Inf2.xlarge

Sequence Classification

2116.28

0.93

1.0

$0.099

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1929.1

1.03

1.19

$0.109

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

730.18

2.72

2.82

$0.288

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

728.39

2.75

2.84

$0.289

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1932.63

1.03

1.13

$0.109

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3200.64

0.61

0.89

$0.066

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3201.45

0.61

0.89

$0.066

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3199.37

0.61

0.89

$0.066

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1962.09

1.02

1.11

$0.107

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1934.61

1.03

1.16

$0.109

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

731.29

2.72

2.91

$0.288

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.1

Inf2.48xlarge

Raw Output (AutoModel)

42.01

47.62

61.71

$85.832

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

Decoder Models#

Model

Scripts

Framework

Inst. Type

Task

Output Token Throughput (tokens/sec)

TTFT Latency P50 (ms)

TTFT Latency P99 (ms)

TPOT Latency P50 (ms)

TPOT Latency P99 (ms)

Application Type

Neuron Version

Run Mode

TP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

On-Demand hourly rate

Cost per 1M tokens

Llama-2-7b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

725.83

77.36

87.28

12.11

13.06

Batch

2.18.0

Tensor Parallel

24

8

4096

128

3968

FP16

Matmult-BF16

12.981

$4.968

Llama-2-7b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

577.97

80.12

89.69

16.39

17.81

Batch

2.18.0

Tensor Parallel

24

8

8192

128

8064

FP16

Matmult-BF16

12.981

$6.239

Llama-2-13b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

589.89

108.81

113.89

14.9

15.79

Batch

2.18.0

Tensor Parallel

24

8

4096

128

3968

FP16

Matmult-BF16

12.981

$6.113

Llama-2-13b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

351.76

7083.73

7158.32

20.99

21.8

Batch

2.18.0

Tensor Parallel

24

8

8192

4096

4096

FP16

Matmult-BF16

12.981

$10.251

Llama-2-13b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

178.57

5141.32

5160.93

21.71

22.74

Batch

2.18.0

Tensor Parallel

24

4

16384

8192

8192

FP16

Matmult-BF16

12.981

$20.193

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

30.06

76.6

77.12

32.9

33.42

Batch

2.18.0

Tensor Parallel

24

1

256

128

128

FP16

Matmult-BF16

12.981

$119.940

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

29.92

96.44

98.47

33.13

33.46

Batch

2.18.0

Tensor Parallel

24

1

512

256

256

FP16

Matmult-BF16

12.981

$120.499

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

30.07

76.33

86.53

33.15

34.08

Batch

2.18.0

Tensor Parallel

24

1

1152

128

1024

FP16

Matmult-BF16

12.981

$119.914

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

29.43

277.02

280.13

33.73

34.01

Batch

2.18.0

Tensor Parallel

24

1

2048

1024

1024

FP16

Matmult-BF16

12.981

$122.539

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

28.91

275.97

284.77

34.82

35.44

Batch

2.18.0

Tensor Parallel

24

1

3072

1024

2048

FP16

Matmult-BF16

12.981

$124.711

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

28.33

810.44

814.88

34.9

35.14

Batch

2.18.0

Tensor Parallel

24

1

4096

2048

2048

FP16

Matmult-BF16

12.981

$127.292

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

761.89

77.62

86.63

11.64

12.5

Batch

2.18.0

Tensor Parallel

24

8

4096

128

3968

FP16

Matmult-BF16

12.981

$4.733

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

450.38

4740.12

4783.75

16.55

17.53

Batch

2.18.0

Tensor Parallel

24

8

8192

4096

4096

FP16

Matmult-BF16

12.981

$8.006

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

411.05

11085.12

11125.86

18.01

19.96

Batch

2.18.0

Tensor Parallel

24

8

16384

8192

8192

FP16

Matmult-BF16

12.981

$8.772

CodeLlama-13b-hf

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

546.51

115.81

121.5

15.87

17.21

Batch

2.18.0

Tensor Parallel

24

8

4096

128

3968

FP16

Matmult-BF16

12.981

$6.598

CodeLlama-13b-hf

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

333.24

7115.98

7231.01

22.27

23.81

Batch

2.18.0

Tensor Parallel

24

8

8192

4096

4096

FP16

Matmult-BF16

12.981

$10.821

CodeLlama-13b-hf

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

178.79

5136.62

5192.59

21.67

22.73

Batch

2.18.0

Tensor Parallel

24

4

16384

8192

8192

FP16

Matmult-BF16

12.981

$20.168

Model

Scripts

Framework

Inst. Type

Task

Output Token Throughput (tokens/sec)

TTFT Latency P50 (ms)

TTFT Latency P99 (ms)

TPOT Latency P50 (ms)

TPOT Latency P99 (ms)

Application Type

Neuron Version

Run Mode

TP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

On-Demand hourly rate

Cost per 1M tokens

Llama-2-7b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

156.13

27.64

33.77

6.47

7.08

Real Time

2.18.0

Tensor Parallel

24

1

4096

128

3968

FP16

Matmult-BF16

12.981

$23.095

Llama-2-7b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

145.17

29.21

33.39

7.34

7.8

Real Time

2.18.0

Tensor Parallel

24

1

8192

128

8064

FP16

Matmult-BF16

12.981

$24.839

Llama-2-13b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

112.52

25.85

26.9

9.17

9.33

Real Time

2.18.0

Tensor Parallel

24

1

4096

128

3968

FP16

Matmult-BF16

12.981

$32.046

Llama-2-13b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

97.42

333.78

340.99

10.17

10.38

Real Time

2.18.0

Tensor Parallel

24

1

8192

4096

4096

FP16

Matmult-BF16

12.981

$37.015

Llama-2-13b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

73.17

994.18

999.8

13.5

13.98

Real Time

2.18.0

Tensor Parallel

24

1

16384

8192

8192

FP16

Matmult-BF16

12.981

$49.282

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

30.06

76.6

77.12

32.9

33.42

Real Time

2.18.0

Tensor Parallel

24

1

256

128

128

FP16

Matmult-BF16

12.981

$119.940

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

29.92

96.44

98.47

33.13

33.46

Real Time

2.18.0

Tensor Parallel

24

1

512

256

256

FP16

Matmult-BF16

12.981

$120.499

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

30.07

76.33

86.53

33.15

34.08

Real Time

2.18.0

Tensor Parallel

24

1

1152

128

1024

FP16

Matmult-BF16

12.981

$119.914

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

29.43

277.02

280.13

33.73

34.01

Real Time

2.18.0

Tensor Parallel

24

1

2048

1024

1024

FP16

Matmult-BF16

12.981

$122.539

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

28.91

275.97

284.77

34.82

35.44

Real Time

2.18.0

Tensor Parallel

24

1

3072

1024

2048

FP16

Matmult-BF16

12.981

$124.711

Llama-2-70b

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

28.33

810.44

814.88

34.9

35.14

Real Time

2.18.0

Tensor Parallel

24

1

4096

2048

2048

FP16

Matmult-BF16

12.981

$127.292

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

149.74

27.34

29.21

6.86

7.08

Real Time

2.18.0

Tensor Parallel

24

1

4096

128

3968

FP16

Matmult-BF16

12.981

$24.081

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

81.7

557.96

562.86

7.87

11.65

Real Time

2.18.0

Tensor Parallel

24

1

8192

4096

4096

FP16

Matmult-BF16

12.981

$44.133

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

95.99

539.59

557.1

10.33

10.61

Real Time

2.18.0

Tensor Parallel

24

1

16384

8192

8192

FP16

Matmult-BF16

12.981

$37.563

CodeLlama-13b-hf

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

112.71

27.02

33.25

9.12

9.38

Real Time

2.18.0

Tensor Parallel

24

1

4096

128

3968

FP16

Matmult-BF16

12.981

$31.994

CodeLlama-13b-hf

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

97.52

338.67

340.46

10.15

10.55

Real Time

2.18.0

Tensor Parallel

24

1

8192

4096

4096

FP16

Matmult-BF16

12.981

$36.975

CodeLlama-13b-hf

Sample

Transformers NeuronX

Inf2.48xlarge

Text Generation

73.68

989.5

1000.66

13.44

13.86

Real Time

2.18.0

Tensor Parallel

24

1

16384

8192

8192

FP16

Matmult-BF16

12.981

$48.940

Encoder-Decoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

117.61

8.51

8.53

$15.328

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

111.92

8.97

8.98

$16.108

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

117.6

8.5

8.53

$15.330

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

108.18

9.25

9.26

$16.665

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Vision Transformers Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

deepmind/multimodal-perceiver

16x224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Multimodal Autoencoding

0.83

1250.0

1271.0

$253.681

Real Time

2.18.0

Data Parallel

1

FP32

None

deepmind/vision-perceiver-conv

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

126.5

14.14

14.2

$1.664

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-fourier

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

67.9

29.5

29.68

$3.101

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-learned

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

99.6

18.6

18.7

$2.114

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

google/vit-base-patch16-224

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

1773.97

4.5

4.69

$0.119

Batch

2.18.0

Data Parallel

2

FP32

Matmult-BF16

openai/clip-vit-base-patch32

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

6099.53

46.31

66.27

$0.035

Batch

2.18.0

Data Parallel

64

FP32

Matmult-BF16

openai/clip-vit-large-patch14

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

304.07

105.9

110.58

$0.692

Batch

2.18.0

Data Parallel

8

FP32

Matmult-BF16

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

deepmind/multimodal-perceiver

16x224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Multimodal Autoencoding

0.83

1250.0

1271.0

$253.681

Real Time

2.18.0

Data Parallel

1

FP32

None

deepmind/vision-perceiver-conv

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

126.5

14.14

14.2

$1.664

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-fourier

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

67.9

29.5

29.68

$3.101

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-learned

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

99.6

18.6

18.7

$2.114

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

google/vit-base-patch16-224

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

728.93

1.36

1.4

$0.289

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

openai/clip-vit-base-patch32

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

158.38

6.31

6.34

$1.329

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

openai/clip-vit-large-patch14

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

73.23

13.65

13.71

$2.875

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Convolutional Neural Networks (CNN) Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

UNet

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Segmentation

996.91

16.03

16.12

$0.211

Batch

2.18.0

Data Parallel

4

FP32

Matmult-BF16

resnet101

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

3260.01

78.46

78.57

$0.065

Batch

2.18.0

Data Parallel

64

FP32

Matmult-BF16

resnet152

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

2397.56

106.7

106.83

$0.088

Batch

2.18.0

Data Parallel

64

FP32

Matmult-BF16

resnet18

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

7068.94

4.51

4.56

$0.030

Batch

2.18.0

Data Parallel

8

FP32

Matmult-BF16

resnet34

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

5116.51

6.24

6.31

$0.041

Batch

2.18.0

Data Parallel

8

FP32

Matmult-BF16

resnet50

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

4476.77

7.13

7.2

$0.047

Batch

2.18.0

Data Parallel

8

FP32

Matmult-BF16

vgg11

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

4671.31

54.75

54.85

$0.045

Batch

2.18.0

Data Parallel

64

FP32

Matmult-BF16

vgg16

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

2144.44

14.91

14.98

$0.098

Batch

2.18.0

Data Parallel

8

FP32

Matmult-BF16

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

UNet

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Segmentation

439.64

2.26

2.31

$0.479

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

resnet101

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1027.8

0.96

1.0

$0.205

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

resnet152

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

861.61

1.15

1.2

$0.244

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

resnet18

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1743.72

0.55

0.62

$0.121

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

resnet34

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1368.5

0.73

0.74

$0.154

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

resnet50

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1264.62

0.78

0.81

$0.166

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

vgg11

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

544.75

1.82

1.88

$0.387

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

vgg16

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

450.83

2.21

2.26

$0.467

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

Stable Diffusion 1.5

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.48

2089.0

2093.0

$439.573

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion 2 Inpainting

936x624

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.16

6045.0

6063.4

$1,276.094

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.6

1655.0

1663.0

$348.602

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

768x768

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.18

5504.0

5519.0

$1,156.899

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base

1024x1024

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.08

12200.0

12260.0

$2,567.751

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base & Refiner

1024x1024

Benchmark

PyTorch 2.1

Inf2.8xlarge

Image Generation

0.09

10741.0

11006.0

$5,884.110

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Note

Cost per 1M images is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

Stable Diffusion 1.5

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.48

2089.0

2093.0

$439.573

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion 2 Inpainting

936x624

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.16

6045.0

6063.4

$1,276.094

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.6

1655.0

1663.0

$348.602

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

768x768

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.18

5504.0

5519.0

$1,156.899

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base

1024x1024

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.08

12200.0

12260.0

$2,567.751

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base & Refiner

1024x1024

Benchmark

PyTorch 2.1

Inf2.8xlarge

Image Generation

0.09

10741.0

11006.0

$5,884.110

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Note

Cost per 1M images is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Note

See Neuron Glossary for abbreviations and terms

This document is relevant for: Inf2