This document is relevant for: Trn1, Trn1n

Megatron-LM GPT Pretraining Tutorial#

GPT is a large language model that excels at many natural language processing (NLP) tasks. It is derived from the decoder part of the Transformer. Neuron Reference For Megatron-LM is a library that enables large-scale distributed training of language models such as GPT and is adapted from Megatron-LM. This tutorial explains how to run the Neuron reference for Megatron-LM GPT pretraining on Trainium.

The AWS Neuron SDK provides access to Trainium devices through an extension of PyTorch/XLA - a library that includes the familiar PyTorch interface along with XLA-specific additions. For Trainium customers, this means that existing PyTorch training scripts can be executed on Trn1 instances with minimal code modifications. For additional details relating to PyTorch/XLA, please refer to the official PyTorch/XLA documentation.

To run on Trainium, Neuron Reference For Megatron-LM library includes the following changes:

  • GPU devices are replaced with Pytorch/XLA devices.

  • Pytorch/XLA distributed backend is used to bridge the PyTorch distributed APIs to XLA communication semantics.

  • Pytorch/XLA MpDeviceLoader is used for the data ingestion pipelines. Pytorch/XLA MpDeviceLoader helps improve performance by overlapping the three execution steps: tracing, compilation and data batch loading to the device.

  • CUDA APIs are mapped to generic PyTorch APIs.

  • CUDA fused optimizers are replaced with generic PyTorch alternatives.

The GPT example in this tutorial is an adaptation of the original Megatron-LM GPT example, trained using the Wikipedia dataset.


Logs used in tutorials do not present latest performance numbers

For latest performance numbers visit Neuron Performance

Install PyTorch Neuron#

Before running the tutorial please follow the installation instructions at:

Please set the storage of instance to 512GB or more if you intent to run multiple experiments and save many checkpoints.

Download Preprocessed Wikipedia Dataset#

Download the vocabulary file, the merge table file, and the preprocessed Wikipedia dataset using the following commands:

export DATA_DIR=~/examples_datasets/gpt2

mkdir -p ${DATA_DIR} && cd ${DATA_DIR}

aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin .  --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx .  --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt .  --no-sign-request

See section Preparing Wikipedia dataset from scratch if you would like to recreate the preprocessed dataset from scratch.

Setting up the training environment on trn1.32xlarge#

Please follow the instructions to setup Python virtual environment with Neuron packages.

Install Python3 development package needed to build the data helpers tools. If you are on Amazon Linux, do:

sudo yum install -y python3-devel

If you are on Ubuntu, do:

sudo apt install -y python3-dev

Clone the AWS Neuron Reference for Megatron-LM package, install dependencies, and build the data helpers tool:

cd ~/
git clone
pip install pybind11 regex
pushd .
cd aws-neuron-reference-for-megatron-lm/megatron/data/

GPT Pretraining Python Script#

The GPT pretraining python script is a wrapper that imports the Megatron-LM library modules and sets up the pieces needed by the Megatron-LM trainer: GPT model, loss function, forward pass, data provider. It is adapted from The Neuron changes are:

  • Use XLA device

  • Not using mpu.broadcast_data as it is currently unsupported. Instead each worker reads the data in parallel.

  • Use int instead of long datatype for token data

The script is available at ~/aws-neuron-reference-for-megatron-lm/

GPT Training Shell Script#

The GPT training shell script runs the above python script with following model configurations (for 6.7 billion parameters model):

  • Number of layers: 32

  • Hidden size: 4096

  • Number attention heads: 32

  • Sequence length: 2048

  • Max positional embeddings size: 2048

The following training parameters are used:

  • The number of gradient accumulation microsteps is 64, with worker batch size of 1.

  • The tensor parallelism degree is 8.

  • The data parallelism degree is 4.

  • The number of workers is 32.

Additionally, the script uses:

  • CPU intitialization

  • AdamW optimizer (default).

  • Gradient clipping.

  • No CUDA fusions (bias-gelu, masked-softmax, bias-dropout)

  • Disabled contiguous buffer in local DDP

  • Option --distributed-backend xla picks the XLA distributed backend to bridge the Pytorch distributed APIs to XLA communication semantics.

See this link for a full list of options and their descriptions.


Not all options are supported. Currently only tensor-parallel and data-parallel modes in Neuron Reference For Megatron-LM are supported. We support tensor-parallel degree of 8 and data-parallel degree of upto 64.

The script for running on a single node is available at ~/aws-neuron-reference-for-megatron-lm/examples/

This shell script expects dataset files to be located in ~/examples_datasets/gpt2/ following the steps above. If you place the dataset files in another location, please update the DATA_PATH variable in the shell script.

Initiating a Training Job#

To run the GPT example, first activate the Python virtual environment, change to the Megatron-LM package location, and allow execute permission on the script:

source ~/aws_neuron_venv_pytorch/bin/activate
cd ~/aws-neuron-reference-for-megatron-lm/
chmod +x *.sh

Next, run the parallel compilations of graphs in order to reduce compilation time during the actual run.

neuron_parallel_compile ./examples/

This command performs a short trial run of the training script to extract graphs and then do parallel compilations on those graphs before populating the persistent cache with compiled graphs. This helps reduce the compilation time during the actual run of the training script.


Please ignore the results of the trial run as they are not the actual execution results.

If some or all the graphs were already compiled and cached in the persistent cache, then fewer or none of the graphs would need compilation. To force recompilation, you can remove the cache directory at /var/tmp/neuron-compile-cache/.

Compilation is recommended if there are some changes in the script (such as batch size, number of layers, number of workers, etc.). Compilation will only happen if the model graph or its parameters/compilation flags change.

Finally, run the script for the actual run:


During the run, you will see outputs like below, some lines showing throughput and loss statistics every global step.

`iteration     4873/   10000 | consumed samples:       311872 | elapsed time per iteration (ms): 8718.9 | learning rate: 1.500E-04 | global batch size:    64 | lm loss: 3.296875E+00 | grad norm: 0.430 | throughput: 7.340`

Monitoring Training Job Progress#

Using a single Trn1 instance with 32 NeuronCores, the current GPT pretraining will run for ~81 hours. During this time, you will see the average loss metric begin at 11 and ultimately converge to ~3.2. Throughput for the training job will be ~7.3 seq/sec.

Monitoring Training Job Progress using neuron-top#

With the training job still running, launch a second SSH connection into the trn1 instance, and use the neuron-top command to examine the aggregate NeuronCore utilization.

Monitoring Training Job Progress using TensorBoard#

The demo includes TensorBoard-compatible logging, which allows the learning rate and training metrics to be monitored in real-time. By default, the training script logs metrics to the following TensorBoard log directory ~/aws-neuron-reference-for-megatron-lm/tb_*.

In order to view your training metrics in TensorBoard, first run the following commands in your SSH session:

source ~/aws_neuron_venv_pytorch/bin/activate
cd ~/aws-neuron-reference-for-megatron-lm/
tensorboard --logdir ./

Once running, open a new SSH connection to the instance and port-forward TCP port 6006 (ex: -L 6006: Once the tunnel is established, TensorBoard can then be accessed via web browser at the following URL: http://localhost:6006. Please note that you will not be able to access TensorBoard if you disconnect your port-forwarding SSH session to the Trainium instance.

Finishing the tutorial#

Once you are ready, and the training throughput is as expected, there are a couple of options for finishing the GPT pretraining demo:

Allow the training script to run to completion. If you would like to observe the training script run to completion, it is recommended to launch the training script from a terminal multiplexer such as tmux or screen, and then detach the session so that the training script can run in the background. With this approach, you can safely let the training script run unattended, without risk of an SSH disconnection causing the training job to stop running.

Stop the training job early. To stop the training job early, press CTRL-C in the terminal window in which you launched the training script. In some cases, if you manually cancel a job using CTRL-C and then later want to run the job again, you might first need to terminate all the python processes by the command killall -9 python3 .

Running a multi-node GPT#

We use SLURM to launch multi-node GPT training jobs. Like single node runs, we have a precompilation step followed by the actual run. To precompile:

sbatch examples/pretrain_gpt3_6.7B_compile.slurm

This will precompile the script examples/ on all the nodes and populate the caches.

To run the compiled model:

sbatch examples/pretrain_gpt3_6.7B.slurm

The number of nodes is currently set to 16 and since the tensor-parallel degree used is 8, the data-parallel degree is automatically computed to be 64, resulting in a 8x64 two dimensional mesh parallelism.

The tensorboard logs are written by the last rank and will be in the TensorBoard log directory ~/aws-neuron-reference-for-megatron-lm/tb_*.

Compared to the single-node script, we use an increased batch size of 1024 which gives us a throughput bump of ~98 seq/sec. The number of iterations is also increased with changes in the hyperparameters pertaining to learning rates, weight decay.

Checkpointing GPT Model#

A new mode of checkpointing using serialized tensor and staggered save/load is supported to alleviate memory pressure. To save the model, add the lines:

--save-xser $CHECKPOINT_PATH
--save-interval 1500

This will save the checkpoint at path variable provided for every 1500 iterations.


Please note that the model saves all the model weights, optimizer and rng states (~76GB for a 32 layermodel). And if checkpointed frequently can quickly lead to low disk storage. Make sure there is enough disk space.

To load the checkpoint, we first need to remove --use-cpu-initialization from the script and then add

--load-xser $CHECKPOINT_PATH


Please note not removing the –use-cpu-initialization flag may lead to out-of-memory execution and result in unstable resumption of training.

Preparing Wikipedia Dataset from Scratch#

The process of preparing the Wikipedia dataset follows the original Megatron-LM documentation. You will need a large c5 machine like c5n.18xlarge and using the latest Deep Learning AMI. First download the Wikipedia dataset. Depending on the network bandwidth, this is expected to be about ~65 minutes.

export WIKI_DIR=~/examples_datasets/wiki
mkdir -p $WIKI_DIR && cd $WIKI_DIR


Download the vocabulary and merge table files for the desired model. This example uses the GPT-2 model:

export DATA_DIR=~/examples_datasets/gpt2
export GPT2_DATA=${DATA_DIR}/gpt2

mkdir -p ${GPT2_DATA} && cd ${GPT2_DATA}


mkdir -p ${GPT2_DATA}/checkpoint
wget --content-disposition -O ${GPT2_DATA}/checkpoint/

Extract the downloaded data using WikiExtractor (this step takes about 2 hours):

git clone /tmp/wikiextractor
cd /tmp/wikiextractor
python -m wikiextractor.WikiExtractor --json ~/examples_datasets/wiki/enwiki-latest-pages-articles.xml.bz2 --output ~/examples_datasets/wiki/text/ -q --processes 70 2>&1 | tee wikiextract.out &

The Wikiextractor first preprocesses the template of all pages sequentially, followed by a Map/Reduce process for extracting the pages and converting to the loose json format required by Megatron-LM.

Once the extraction completes, we merge the text files with (~2 minutes):

conda activate pytorch_latest_p37
cd ~/examples_datasets/wiki
find ~/examples_datasets/wiki/text/ -name wiki* | parallel -m -j 70 "cat {} >> mergedfile.json"

The mergedfile.json size on disk is 16GB. With it, create the binary data format for Megatron GPT2. NOTE: Refer to this solution if an IndexError: list index out of range occurs. To create the binary data, type the following command:

python ~/aws-neuron-reference-for-megatron-lm/tools/ \
    --input ~/examples_datasets/wiki/mergedfile.json \
    --output-prefix my-gpt2 \
    --vocab ~/examples_datasets/gpt2/gpt2-vocab.json \
    --dataset-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file ~/examples_datasets/gpt2/gpt2-merges.txt \
    --append-eod \
    --workers 70

Files my-gpt2_text_document.* are generated after about 12 minutes.

Known issues and limitations#

No broadcast support#

Currently, the mpu.broadcast_data is unsupported on Trainium.

No pipeline parallel support#

Currently, only tensor parallel and data parallel are supported and there is no pipeline parallel support in Neuron Reference For Megatron-LM.

Dropout is disabled#

Currently, dropout is disabled in the example.

“Failed accept4: Too many open files”#

When running Megatron-LM GPT3 6.7B example above on Ubuntu Server 20.04 LTS (HVM) and Ubuntu Server 22.04 LTS (HVM) AMIs, you may encounter the following “Failed accept4: Too many open files” error:

E0301 08:06:14.272283286   72588]    Failed accept4: Too many open files
2023-03-01 08:06:15.515834: F tensorflow/libtpu/neuron/] Check failed: fd != -1 Opening lock file failed with errno 24

The reason is that on this AMI, the “ulimit -n” is set to 1024, which is too low compared to for example Amazon Linux 2 AMI (HVM) - Kernel 5.10 where it is set tp 65535 by default. To workaround this issue, please increase “ulimit -n” to a higher value, such as 65535 which matches Amazon Linux 2 AMI (HVM) - Kernel 5.10 and is sufficient for the Megatron-LM GPT3 6.7B example. Additionally, this can be set within the shell script (which is ran using SLURM srun command) so that it is set for each worker process.

ulimit -n 65535

Error: cannot import name ‘helpers’ from ‘’#

You may encounter the error “cannot import name ‘helpers’ from ‘’” like below:

Exception in device=NEURONT:0: cannot import name 'helpers' from '' (/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/
Traceback (most recent call last):
  File "/home/ec2-user/aws_neuron_venv_pytorch_p37/lib64/python3.7/site-packages/torch_xla/distributed/", line 373, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/home/ec2-user/aws_neuron_venv_pytorch_p37/lib64/python3.7/site-packages/torch_xla/distributed/", line 367, in _start_fn
    fn(gindex, *args)
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/", line 138, in pretrain_mp
    forward_step, args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/", line 162, in pretrain
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/", line 1021, in build_train_valid_test_data_iterators
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/", line 128, in train_valid_test_datasets_provider
    skip_warmup=(not args.mmap_warmup))
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/", line 43, in build_train_valid_test_datasets
    seq_length, seed, skip_warmup)
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/", line 118, in _build_train_valid_test_datasets
    train_dataset = build_dataset(0, 'train')
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/", line 115, in build_dataset
    seq_length, seed)
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/", line 156, in __init__
    num_samples, seq_length, seed)
  File "/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/", line 274, in _build_index_mappings
    from import helpers
ImportError: cannot import name 'helpers' from '' (/home/ec2-user/aws-neuron-reference-for-megatron-lm/megatron/data/

To fix this, please go into aws-neuron-reference-for-megatron-lm/megatron/data/ and do “make”:

pip install pybind11
pushd .
cd aws-neuron-reference-for-megatron-lm/megatron/data/

Error: Out of space while checkpointing#

You may seem an error as follows. The model checkpoints are large as they dump all the model weights, optimizer and rng states. And if these are frequently checkpointed, the storage can run out fast. Please make sure you have enough disk space.

Traceback (most recent call last):
  File "/home/ec2-user/aws_neuron_venv_pytorch_p37/lib64/python3.7/site-packages/torch/", line 380, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/home/ec2-user/aws_neuron_venv_pytorch_p37/lib64/python3.7/site-packages/torch/", line 604, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 28] No space left on device