This document is relevant for: Inf2

Inf2 Inference Performance#

Last update: September 16th, 2024

Encoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2999.88

5.32

5.49

$0.030

Batch

2.20.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2697.72

2.95

3.11

$0.034

Batch

2.20.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.1

Inf2.xlarge

Sequence Classification

2907.45

11.03

11.32

$0.031

Batch

2.20.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2756.19

5.75

6.35

$0.033

Batch

2.20.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

906.91

16.95

18.93

$0.100

Batch

2.20.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

919.08

8.73

9.12

$0.099

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2904.22

10.98

11.51

$0.031

Batch

2.20.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

4756.25

1.67

1.81

$0.019

Batch

2.20.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

4741.08

1.68

1.81

$0.019

Batch

2.20.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

5043.38

6.3

6.88

$0.018

Batch

2.20.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2756.87

11.54

12.19

$0.033

Batch

2.20.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2743.63

5.84

5.95

$0.033

Batch

2.20.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

969.49

8.23

8.68

$0.094

Batch

2.20.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.1

Inf2.48xlarge

Raw Output (AutoModel)

51.29

628.73

695.07

$30.369

Batch

2.20.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/second)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2034.71

0.98

1.05

$0.045

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

41.24

1.02

1.09

$2.209

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.1

Inf2.xlarge

Sequence Classification

2117.52

0.93

1.0

$0.043

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1974.52

1.01

1.09

$0.046

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

718.66

2.78

2.86

$0.127

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

728.39

2.78

2.86

$0.125

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1931.2

1.03

1.09

$0.047

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3255.95

0.61

0.67

$0.028

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3255.97

0.61

0.68

$0.028

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

3248.31

0.61

0.68

$0.028

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

2019.69

0.98

1.08

$0.045

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

1926.26

1.03

1.1

$0.047

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.1

Inf2.xlarge

Raw Output (AutoModel)

721.21

2.76

2.84

$0.126

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.1

Inf2.48xlarge

Raw Output (AutoModel)

41.24

48.27

58.71

$37.773

Real Time

2.20.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

Encoder-Decoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

117.61

8.51

8.53

$6.623

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

111.92

8.97

8.98

$6.959

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

117.6

8.5

8.53

$6.623

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

Inf2.24xlarge

Text Generation

108.18

9.25

9.26

$7.200

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Vision Transformers Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

deepmind/multimodal-perceiver

16x224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Multimodal Autoencoding

0.83

1250.0

1271.0

$109.772

Real Time

2.18.0

Data Parallel

1

FP32

deepmind/vision-perceiver-conv

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

126.5

14.14

14.2

$0.720

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-fourier

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

67.9

29.5

29.68

$1.342

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-learned

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

99.6

18.6

18.7

$0.915

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

google/vit-base-patch16-224

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

1773.97

4.5

4.69

$0.051

Batch

2.18.0

Data Parallel

2

FP32

Matmult-BF16

openai/clip-vit-base-patch32

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

6099.53

46.31

66.27

$0.015

Batch

2.18.0

Data Parallel

64

FP32

Matmult-BF16

openai/clip-vit-large-patch14

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

304.07

105.9

110.58

$0.300

Batch

2.18.0

Data Parallel

8

FP32

Matmult-BF16

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

deepmind/multimodal-perceiver

16x224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Multimodal Autoencoding

0.83

1250.0

1271.0

$109.772

Real Time

2.18.0

Data Parallel

1

FP32

deepmind/vision-perceiver-conv

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

126.5

14.14

14.2

$0.720

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-fourier

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

67.9

29.5

29.68

$1.342

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

deepmind/vision-perceiver-learned

224x224

Benchmark

PyTorch 1.13.1

Inf2.xlarge

Image Classification

99.6

18.6

18.7

$0.915

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

google/vit-base-patch16-224

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

728.93

1.36

1.4

$0.125

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

openai/clip-vit-base-patch32

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

158.38

6.31

6.34

$0.575

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

openai/clip-vit-large-patch14

224x224

Benchmark

PyTorch 2.1.2

Inf2.xlarge

Image Classification

73.23

13.65

13.71

$1.244

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Convolutional Neural Networks (CNN) Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

UNet

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Segmentation

999.94

15.99

16.08

$0.091

Batch

2.20.0

Data Parallel

4

FP32

Matmult-BF16

resnet101

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

3178.53

80.43

80.56

$0.029

Batch

2.20.0

Data Parallel

64

FP32

Matmult-BF16

resnet152

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

2430.45

105.24

105.36

$0.037

Batch

2.20.0

Data Parallel

64

FP32

Matmult-BF16

resnet18

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

6944.75

4.59

4.66

$0.013

Batch

2.20.0

Data Parallel

8

FP32

Matmult-BF16

resnet34

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

5116.51

6.24

6.31

$0.018

Batch

2.20.0

Data Parallel

8

FP32

Matmult-BF16

resnet50

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

4420.47

7.23

7.31

$0.021

Batch

2.20.0

Data Parallel

8

FP32

Matmult-BF16

vgg11

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

4687.13

54.55

54.65

$0.019

Batch

2.20.0

Data Parallel

64

FP32

Matmult-BF16

vgg16

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

2095.37

15.25

15.32

$0.043

Batch

2.20.0

Data Parallel

8

FP32

Matmult-BF16

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

UNet

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Segmentation

443.54

2.26

2.28

$0.205

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

resnet101

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

984.12

1.01

1.03

$0.093

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

resnet152

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

828.62

1.2

1.23

$0.110

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

resnet18

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1637.7

0.6

0.63

$0.056

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

resnet34

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1368.5

0.73

0.74

$0.067

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

resnet50

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

1246.26

0.79

0.85

$0.073

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

vgg11

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

630.69

1.61

1.63

$0.144

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

vgg16

224x224

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Classification

441.12

2.26

2.3

$0.207

Real Time

2.20.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

Stable Diffusion 1.5

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.48

2089.0

2093.0

$190.211

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion 2 Inpainting

936x624

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.16

6045.0

6063.4

$552.189

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.6

1655.0

1663.0

$150.846

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

768x768

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.18

5504.0

5519.0

$500.611

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base

1024x1024

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.08

12200.0

12260.0

$1,111.111

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base & Refiner

1024x1024

Benchmark

PyTorch 2.1

Inf2.8xlarge

Image Generation

0.09

10741.0

11006.0

$2,538.829

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

Stable Diffusion 1.5

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.48

2089.0

2093.0

$190.211

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion 2 Inpainting

936x624

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.16

6045.0

6063.4

$552.189

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.6

1655.0

1663.0

$150.846

Real Time

2.18.0

Data Parallel

1

FP32, BF16

Matmult-BF16

Stable Diffusion 2.1

768x768

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.18

5504.0

5519.0

$500.611

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base

1024x1024

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.08

12200.0

12260.0

$1,111.111

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Stable Diffusion XL Base & Refiner

1024x1024

Benchmark

PyTorch 2.1

Inf2.8xlarge

Image Generation

0.09

10741.0

11006.0

$2,538.829

Real Time

2.18.0

Data Parallel

1

FP32

Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Diffusion Transformer Models#

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

PixArt Alpha

256x256

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

1.98

502.59

537.26

$46.132

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

PixArt Alpha

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.56

1769.76

1775.7

$161.259

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

PixArt Sigma

256x256

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

1.86

540.83

548.41

$48.984

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

PixArt Sigma

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.54

1841.88

1850.68

$167.792

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Image Size

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M images

Application Type

Neuron Version

Run Mode

Batch Size

Model Data Type

Compilation Autocast Data Type

PixArt Alpha

256x256

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

1.98

502.59

537.26

$46.132

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

PixArt Alpha

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.56

1769.76

1775.7

$161.259

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

PixArt Sigma

256x256

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

1.86

540.83

548.41

$48.984

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

PixArt Sigma

512x512

Benchmark

PyTorch 2.1

Inf2.xlarge

Image Generation

0.54

1841.88

1850.68

$167.792

Real Time

2.2

Data Parallel

1

“FP32, BF16”

Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Note

See Neuron Glossary for abbreviations and terms

This document is relevant for: Inf2