This document is relevant for: Inf2, Trn1, Trn1n

Trn1/Trn1n Performance#

Last update: April 12th, 2023

Training Performance (Trn1 / Trn1n)#

Model

Model Data-Type

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace BERT-Large Ph1 pre-training

FP32

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

16

262144

Adam

48414

weak scaling

2.9.0

Hugging Face BERT Pretraining Tutorial

1.13.0.1.6.0

U20

HuggingFace BERT-Large Ph2 pre-training

FP32

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

2

524288

Adam

7522

weak scaling

2.9.0

Hugging Face BERT Pretraining Tutorial

1.13.0.1.6.0

U20

HuggingFace BERT-Large Ph1 pre-training

FP32

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

16

262144

Lamb

41242

strong scaling

2.9.0

Hugging Face BERT Pretraining Tutorial

1.13.0.1.6.0

U20

GPT3-6.7B pre-training

FP32

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[8xNC(TP)x4(DP)] x 16Nodes(DP)

1

1024

129

2.9.0

Megatron-LM GPT Pretraining Tutorial

1.13.0.1.6.0

U20

Inference Performance#

Model

Framework

Inst. Type

Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

albert-base-v2

PyTorch 1.13.0

trn1.2xlarge

2693

11.86

12.16

$0.138

Batch

2.8.0

Data Parallel

16

bf16

bert-base-cased

PyTorch 1.13.0

trn1.2xlarge

2607

6.11

6.27

$0.143

Batch

2.8.0

Data Parallel

8

bf16

bert-base-cased-finetuned-mrpc

PyTorch 1.13.0

trn1.2xlarge

2990

5.33

5.37

$0.124

Batch

2.8.0

Data Parallel

8

bf16

bert-large-cased

PyTorch 1.13.0

trn1.2xlarge

958

8.33

8.51

$0.388

Batch

2.8.0

Data Parallel

4

bf16

roberta-base

PyTorch 1.13.0

trn1.2xlarge

2565

6.21

6.42

$0.145

Batch

2.8.0

Data Parallel

8

bf16

roberta-large

PyTorch 1.13.0

trn1.2xlarge

948

8.42

8.53

$0.393

Batch

2.8.0

Data Parallel

4

bf16

Note

Throughput optimization was performed by selecting a batch size which maximized the metric. All compiler flags, data types, and parameters are identical between model configurations Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Model

Framework

Inst. Type

Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

albert-base-v2

PyTorch 1.13.0

trn1.2xlarge

1792

1.09

1.54

$0.208

Real Time

2.8.0

Data Parallel

1

bf16

bert-base-cased

PyTorch 1.13.0

trn1.2xlarge

1792

1.1

1.49

$0.208

Real Time

2.8.0

Data Parallel

1

bf16

bert-base-cased-finetuned-mrpc

PyTorch 1.13.0

trn1.2xlarge

1834

1.09

1.12

$0.203

Real Time

2.8.0

Data Parallel

1

bf16

bert-large-cased

PyTorch 1.13.0

trn1.2xlarge

658

3.03

3.09

$0.566

Real Time

2.8.0

Data Parallel

1

bf16

roberta-base

PyTorch 1.13.0

trn1.2xlarge

1729

1.15

1.2

$0.215

Real Time

2.8.0

Data Parallel

1

bf16

roberta-large

PyTorch 1.13.0

trn1.2xlarge

645

3.09

3.13

$0.577

Real Time

2.8.0

Data Parallel

1

bf16

Note

Latency optimization was performed by selecting a batch size which maximized the metric. All compiler flags, data types, and parameters are identical between model configurations Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf2, Trn1, Trn1n