This document is relevant for: Inf2, Trn1, Trn1n

Trn1/Trn1n Performance#

Table of contents

Training Performance (Trn1 / Trn1n)
Inference Performance

Last update: April 12th, 2023

Training Performance (Trn1 / Trn1n)#

Model	Model Data-Type	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Global Minibatch	Optimizer	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
HuggingFace BERT-Large Ph1 pre-training	FP32	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[32xNC(DP)] x 16Nodes(DP)	16	262144	Adam	48414	weak scaling	2.9.0	Hugging Face BERT Pretraining Tutorial	1.13.0.1.6.0	U20
HuggingFace BERT-Large Ph2 pre-training	FP32	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[32xNC(DP)] x 16Nodes(DP)	2	524288	Adam	7522	weak scaling	2.9.0	Hugging Face BERT Pretraining Tutorial	1.13.0.1.6.0	U20
HuggingFace BERT-Large Ph1 pre-training	FP32	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[32xNC(DP)] x 16Nodes(DP)	16	262144	Lamb	41242	strong scaling	2.9.0	Hugging Face BERT Pretraining Tutorial	1.13.0.1.6.0	U20
GPT3-6.7B pre-training	FP32	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[8xNC(TP)x4(DP)] x 16Nodes(DP)	1	1024		129		2.9.0	Megatron-LM GPT Pretraining Tutorial	1.13.0.1.6.0	U20

Inference Performance #

Throughput optimized

Model	Framework	Inst. Type	Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
albert-base-v2	PyTorch 1.13.0	trn1.2xlarge	2693	11.86	12.16	$0.138	Batch	2.8.0	Data Parallel	16	bf16
bert-base-cased	PyTorch 1.13.0	trn1.2xlarge	2607	6.11	6.27	$0.143	Batch	2.8.0	Data Parallel	8	bf16
bert-base-cased-finetuned-mrpc	PyTorch 1.13.0	trn1.2xlarge	2990	5.33	5.37	$0.124	Batch	2.8.0	Data Parallel	8	bf16
bert-large-cased	PyTorch 1.13.0	trn1.2xlarge	958	8.33	8.51	$0.388	Batch	2.8.0	Data Parallel	4	bf16
roberta-base	PyTorch 1.13.0	trn1.2xlarge	2565	6.21	6.42	$0.145	Batch	2.8.0	Data Parallel	8	bf16
roberta-large	PyTorch 1.13.0	trn1.2xlarge	948	8.42	8.53	$0.393	Batch	2.8.0	Data Parallel	4	bf16

Note

Throughput optimization was performed by selecting a batch size which maximized the metric. All compiler flags, data types, and parameters are identical between model configurations Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Latency optimized

Model	Framework	Inst. Type	Throughput (/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Model details
albert-base-v2	PyTorch 1.13.0	trn1.2xlarge	1792	1.09	1.54	$0.208	Real Time	2.8.0	Data Parallel	1	bf16
bert-base-cased	PyTorch 1.13.0	trn1.2xlarge	1792	1.1	1.49	$0.208	Real Time	2.8.0	Data Parallel	1	bf16
bert-base-cased-finetuned-mrpc	PyTorch 1.13.0	trn1.2xlarge	1834	1.09	1.12	$0.203	Real Time	2.8.0	Data Parallel	1	bf16
bert-large-cased	PyTorch 1.13.0	trn1.2xlarge	658	3.03	3.09	$0.566	Real Time	2.8.0	Data Parallel	1	bf16
roberta-base	PyTorch 1.13.0	trn1.2xlarge	1729	1.15	1.2	$0.215	Real Time	2.8.0	Data Parallel	1	bf16
roberta-large	PyTorch 1.13.0	trn1.2xlarge	645	3.09	3.13	$0.577	Real Time	2.8.0	Data Parallel	1	bf16

Note

Latency optimization was performed by selecting a batch size which maximized the metric. All compiler flags, data types, and parameters are identical between model configurations Cost per 1M inferences is calculated using On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf2, Trn1, Trn1n

AWS Neuron Documentation

Trn1/Trn1n Performance

Contents

Trn1/Trn1n Performance#

Training Performance (Trn1 / Trn1n)#

Inference Performance #

AWS Neuron Documentation

Trn1/Trn1n Performance

Contents

Trn1/Trn1n Performance#

Training Performance (Trn1 / Trn1n)#

Inference Performance#

Inference Performance #