This document is relevant for: Trn1

Trn1/Trn1n Inference Performance#

Last update: September 16th, 2024

Encoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

3113.66

10.28

10.33

$0.046

Batch

2.20.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

3124.93

10.23

10.3

$0.046

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.5

trn1.2xlarge

Sequence Classification

3499.81

9.14

9.16

$0.041

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

3128.66

10.23

10.28

$0.045

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

1062.71

15.04

15.12

$0.134

Batch

2.21.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

1061.39

7.54

7.59

$0.134

Batch

2.21.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

3101.04

10.3

10.36

$0.046

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

5704.33

5.6

5.67

$0.025

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

5705.78

5.6

5.68

$0.025

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

5710.31

5.6

5.65

$0.025

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

3133.47

10.22

10.27

$0.045

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

3102.21

10.29

10.44

$0.046

Batch

2.21.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.5

trn1.2xlarge

Raw Output (AutoModel)

1059.13

15.09

15.16

$0.134

Batch

2.21.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.5

trn1.32xlarge

Raw Output (AutoModel)

46.56

343.58

347.47

$48.901

Batch

2.21.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

Encoder-Decoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

122.52

8.16

8.19

$18.584

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

116.29

8.58

8.66

$19.580

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Applicable to all models

Cost per 1M inferences is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

120.29

8.31

8.34

$18.929

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

110.23

9.07

9.12

$20.656

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Note

Cost per 1M inferences is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Trn1