This document is relevant for: Inf2, Trn1, Trn2

Training GPT-NeoX 20B with Tensor Parallelism and ZeRO-1 Optimizer#

In this section, we showcase to pretrain a GPT-NeoX 20B model by using the sequence parallel optimization of tensor parallelism in the neuronx-distributed package. Please refer to the Neuron Samples repository to view the files in this tutorial.

This GPT-NeoX 20B tutorial differs from the GPT-NeoX 6.9B tutorial in the following ways:

sequence parallel optimization has been applied
parallel cross entropy has been applied
the model size has been increased from 6.9B to 20B
the TP degree has been increased from 8 to 32

Setting up environment is same as the GPT-NeoX 6.9B tutorial.

Let’s download the scripts for pretraining:

cd ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/adamw_fp32_optim_params.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/requirements.txt ./
python3 -m pip install -r requirements.txt

Next let’s download and pre-process the dataset:

python3 get_dataset.py

At this point, you are all set to start training.

Running training

We first pre-compile the graphs using the neuron_parallel_compile. Let’s run the command below:

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"

This script uses a tensor-parallel size of 32. This will automatically set the zero-1 sharding degree to 4 (4 * 32 workers / tensor_parallel_size). Once the graphs are compiled we can now run training and observe our loss goes down. To run the training, we just the above command but without neuron_parallel_compile.

sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"

Sequence Parallel

We made the following model level modifications to enable sequence parallel:

turn on sequence_parallel_enabled of ColumnParallelLinear and RowParallelLinear in GPTNeoXAttention and GPTNeoXMLP;
replace torch LayerNorm in GPTNeoXLayer and GPTNeoXModel with neuronx-distributed LayerNorm with sequence_parallel_enabled turned on;
dimension transposition of intermediate states in the forward function of GPTNeoXAttention.
dimension transposition and collective communication of intermediate states in the forward function of GPTNeoXModel.

In the training training script level, we enable:

all-reduce sequence parallel gradients at the gradient accumulation boundary.

Please check modeling_gpt_neox_nxd.py and tp_dp_gpt_neox_20b_hf_pretrain.py for details.

Parallel Cross Entropy

To enable parallel cross entropy, we made the following model level modeifincations:

replace the CrossEntropyLoss with neuronx-distributed parallel_cross_entropy in the forward function of GPTNeoXForCausalLM.
use ColumnParallelLinear for the embed_out layer in GPTNeoXForCausalLM.

Please check modeling_gpt_neox_nxd.py for details.

This document is relevant for: Inf2, Trn1, Trn2