This document is relevant for: Trn1, Trn1n

Trn1/Trn1n Inference Performance#

Last update: April 30th, 2024

Encoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3098.91

10.32

10.39

$0.120

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3176.04

10.07

10.13

$0.117

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.1

trn1.2xlarge

Sequence Classification

3584.4

8.92

8.96

$0.104

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3165.0

10.09

10.21

$0.118

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

1058.66

15.09

15.21

$0.352

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

1056.08

7.57

7.65

$0.352

Batch

2.18.0

Data Parallel

4

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3072.36

10.41

10.57

$0.121

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

5554.69

5.75

5.88

$0.067

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

5585.58

5.72

5.8

$0.067

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

5557.51

5.75

5.87

$0.067

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3167.84

10.09

10.19

$0.118

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3082.21

10.34

10.69

$0.121

Batch

2.18.0

Data Parallel

16

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

1053.24

15.19

15.26

$0.353

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.1

trn1.32xlarge

Raw Output (AutoModel)

40.74

392.23

396.12

$146.608

Batch

2.18.0

Data Parallel

8

128

FP32

Matmult-BF16

U22

Model

Scripts

Framework

Inst. Type

Task

Throughput (inference/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Sequence Length

Model Data Type

Compilation Autocast Data Type

OS Type

albert-base-v2

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

2122.39

0.94

0.98

$0.175

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-cased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

1999.59

0.99

1.03

$0.186

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-cased-finetuned-mrpc

Benchmark

PyTorch 2.1

trn1.2xlarge

Sequence Classification

2154.24

0.93

0.95

$0.173

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-base-uncased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

1999.54

0.99

1.03

$0.186

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-large-cased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

739.62

2.69

2.75

$0.503

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

bert-large-uncased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

740.18

2.69

2.74

$0.503

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

camembert-base

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

2000.81

0.99

1.03

$0.186

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-cased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3169.09

0.63

0.66

$0.117

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-cased-distilled-squad

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3164.73

0.64

0.66

$0.118

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

distilbert-base-uncased

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

3165.42

0.63

0.66

$0.118

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

google/electra-base-discriminator

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

2018.08

0.99

1.02

$0.184

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

roberta-base

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

2001.64

0.99

1.03

$0.186

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

roberta-large

Benchmark

PyTorch 2.1

trn1.2xlarge

Raw Output (AutoModel)

734.46

2.73

2.77

$0.507

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

xlm-roberta-base

Benchmark

PyTorch 2.1

trn1.32xlarge

Raw Output (AutoModel)

40.11

49.9

51.54

$148.911

Real Time

2.18.0

Data Parallel

1

128

FP32

Matmult-BF16

U22

Decoder Models#

Model

Scripts

Framework

Inst. Type

Task

Output Token Throughput (tokens/sec)

TTFT Latency P50 (ms)

TTFT Latency P99 (ms)

TPOT Latency P50 (ms)

TPOT Latency P99 (ms)

Application Type

Neuron Version

Run Mode

TP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

Weight Storage Data Type

On-Demand hourly rate

Cost per 1M tokens

Llama-3-8B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

933.5

55.16

61.47

9.95

10.1

Batch

2.18.1

Tensor Parallel

32

8

8192

128

8064

FP16

Matmult-BF16

int8

21.5

$6.398

Llama-3-8B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

770.16

1265.95

1292.94

10.04

10.33

Batch

2.18.1

Tensor Parallel

32

8

8192

4096

4096

FP16

Matmult-BF16

int8

21.5

$7.754

Llama-3-8B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

1142.7

49.05

52.79

7.65

7.94

Batch

2.18.1

Tensor Parallel

32

8

4096

128

3968

FP16

Matmult-BF16

int8

21.5

$5.226

Llama-3-70B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

120.36

1661.12

1672.71

32.33

33.27

Batch

2.18.1

Tensor Parallel

32

4

4096

2048

2048

FP16

Matmult-BF16

bf16

21.5

$49.619

Llama-3-70B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

140.51

129.86

132.03

28.38

29.11

Batch

2.18.1

Tensor Parallel

32

4

1152

128

1024

FP16

Matmult-BF16

bf16

21.5

$42.504

Llama-3-70B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

138.01

130.37

130.48

28.08

28.53

Batch

2.18.1

Tensor Parallel

32

4

256

128

128

FP16

Matmult-BF16

bf16

21.5

$43.273

Llama-2-7b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

917.25

66.4

70.64

10.1

10.46

Batch

2.18.0

Tensor Parallel

32

8

8192

128

8064

FP16

Matmult-BF16

bf16

21.5

$6.511

Llama-2-13b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

371.7

6668.7

6689.8

19.86

21.06

Batch

2.18.0

Tensor Parallel

32

8

8192

4096

4096

FP16

Matmult-BF16

bf16

21.5

$16.067

Llama-2-13b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

184.28

4628.45

4635.25

21.09

22.39

Batch

2.18.0

Tensor Parallel

32

4

16384

8192

8192

FP16

Matmult-BF16

bf16

21.5

$32.408

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

141.45

156.85

158.41

26.72

30.17

Batch

2.18.0

Tensor Parallel

32

4

256

128

128

FP16

Matmult-BF16

bf16

21.5

$42.220

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

143.43

270.16

270.56

26.91

27.91

Batch

2.18.0

Tensor Parallel

32

4

512

256

256

FP16

Matmult-BF16

bf16

21.5

$41.640

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

145.13

156.69

161.41

27.21

30.6

Batch

2.18.0

Tensor Parallel

32

4

1152

128

1024

FP16

Matmult-BF16

bf16

21.5

$41.151

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

133.25

1478.64

1479.78

28.55

29.5

Batch

2.18.0

Tensor Parallel

32

4

2048

1024

1024

FP16

Matmult-BF16

bf16

21.5

$44.819

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

129.28

1478.85

1482.93

31.67

32.02

Batch

2.18.0

Tensor Parallel

32

4

3072

1024

2048

FP16

Matmult-BF16

bf16

21.5

$46.197

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

120.63

2722.03

2730.95

31.79

33.23

Batch

2.18.0

Tensor Parallel

32

4

4096

2048

2048

FP16

Matmult-BF16

bf16

21.5

$49.509

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

484.58

8614.85

8630.24

15.44

15.94

Batch

2.18.0

Tensor Parallel

32

8

16384

8192

8192

FP16

Matmult-BF16

bf16

21.5

$12.325

CodeLlama-13b-hf

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

370.98

6625.16

6628.26

19.92

20.95

Batch

2.18.0

Tensor Parallel

32

8

8192

4096

4096

FP16

Matmult-BF16

bf16

21.5

$16.099

CodeLlama-13b-hf

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

184.18

4626.17

4630.67

21.1

22.17

Batch

2.18.0

Tensor Parallel

32

4

16384

8192

8192

FP16

Matmult-BF16

bf16

21.5

$32.426

Model

Scripts

Framework

Inst. Type

Task

Output Token Throughput (tokens/sec)

TTFT Latency P50 (ms)

TTFT Latency P99 (ms)

TPOT Latency P50 (ms)

TPOT Latency P99 (ms)

Application Type

Neuron Version

Run Mode

TP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

Weight Storage Data Type

On-Demand hourly rate

Cost per 1M tokens

Llama-3-8B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

157.25

17.09

21.62

7.03

7.16

Real Time

2.18.1

Tensor Parallel

32

1

8192

128

8064

FP16

Matmult-BF16

int8

21.5

$37.979

Llama-3-8B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

140.5

153.02

159.13

7.04

7.13

Real Time

2.18.1

Tensor Parallel

32

1

8192

4096

4096

FP16

Matmult-BF16

int8

21.5

$42.507

Llama-3-8B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

178.19

14.75

22.94

5.86

6.0

Real Time

2.18.1

Tensor Parallel

32

1

4096

128

3968

FP16

Matmult-BF16

int8

21.5

$33.516

Llama-3-70B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

37.7

547.0

553.89

26.2

26.79

Real Time

2.18.1

Tensor Parallel

32

1

4096

2048

2048

FP16

Matmult-BF16

bf16

21.5

$158.398

Llama-3-70B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

40.64

53.2

59.5

24.48

26.17

Real Time

2.18.1

Tensor Parallel

32

1

1152

128

1024

FP16

Matmult-BF16

bf16

21.5

$146.961

Llama-3-70B

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

40.81

52.53

52.79

26.48

24.22

Real Time

2.18.1

Tensor Parallel

32

1

256

128

128

FP16

Matmult-BF16

bf16

21.5

$146.342

Llama-2-7b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

161.71

13.32

14.12

6.7

6.85

Real Time

2.18.0

Tensor Parallel

32

1

8192

128

8064

FP16

Matmult-BF16

bf16

21.5

$36.932

Llama-2-13b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

60.43

864.14

865.91

9.84

10.15

Real Time

2.18.0

Tensor Parallel

32

1

8192

4096

4096

FP16

Matmult-BF16

bf16

21.5

$98.823

Llama-2-13b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

31.4

2367.93

2369.14

13.41

15.77

Real Time

2.18.0

Tensor Parallel

32

1

16384

8192

8192

FP16

Matmult-BF16

bf16

21.5

$190.204

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

39.29

53.91

54.95

25.18

26.58

Real Time

2.18.0

Tensor Parallel

32

1

256

128

128

FP16

Matmult-BF16

bf16

21.5

$152.020

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

39.18

81.88

98.78

25.27

25.76

Real Time

2.18.0

Tensor Parallel

32

1

512

256

256

FP16

Matmult-BF16

bf16

21.5

$152.443

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

39.16

57.75

64.76

25.45

26.13

Real Time

2.18.0

Tensor Parallel

32

1

1152

128

1024

FP16

Matmult-BF16

bf16

21.5

$152.493

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

38.1

232.48

239.03

26.04

26.18

Real Time

2.18.0

Tensor Parallel

32

1

2048

1024

1024

FP16

Matmult-BF16

bf16

21.5

$156.771

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

37.71

236.78

241.15

26.62

27.03

Real Time

2.18.0

Tensor Parallel

32

1

3072

1024

2048

FP16

Matmult-BF16

bf16

21.5

$158.375

Llama-2-70b

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

36.78

690.96

695.92

26.85

27.04

Real Time

2.18.0

Tensor Parallel

32

1

4096

2048

2048

FP16

Matmult-BF16

bf16

21.5

$162.376

Mistral-7B-Instruct-v0.2

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

49.56

1322.87

1325.86

9.89

10.18

Real Time

2.18.0

Tensor Parallel

32

1

16384

8192

8192

FP16

Matmult-BF16

bf16

21.5

$120.508

CodeLlama-13b-hf

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

60.22

868.64

870.98

9.86

10.24

Real Time

2.18.0

Tensor Parallel

32

1

8192

4096

4096

FP16

Matmult-BF16

bf16

21.5

$99.181

CodeLlama-13b-hf

Sample

Transformers NeuronX

trn1.32xlarge

Text Generation

31.38

2372.93

2375.92

13.4

13.79

Real Time

2.18.0

Tensor Parallel

32

1

16384

8192

8192

FP16

Matmult-BF16

bf16

21.5

$190.333

Encoder-Decoder Models#

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

122.52

8.16

8.19

$48.745

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

116.29

8.58

8.66

$51.356

Batch

2.17.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Applicable to all models

Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Scripts

Framework

Inst. Type

Task

Throughput (tokens/second)

Latency per Token P50 (ms)

Latency per Token P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

TP Degree

DP Degree

Batch Size

Sequence Length

Input Length

Output Length

Model Data Type

Compilation Autocast Data Type

google/flan-t5-xl

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

120.29

8.31

8.34

$49.649

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

t5-3b

Tutorial

NeuronX Distributed

trn1.32xlarge

Text Generation

110.23

9.07

9.12

$54.180

Real Time

2.18.0

Tensor Parallel

8

1

1

128

128

84

FP32

Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Note

Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Trn1, Trn1n