This document is relevant for: Inf2, Trn1, Trn2
Training GPT-NeoX 20B with Tensor Parallelism and ZeRO-1 Optimizer#
In this section, we showcase to pretrain a GPT-NeoX 20B model by using the sequence parallel optimization
of tensor parallelism in the neuronx-distributed package. Please refer to the Neuron Samples repository to view the files in this tutorial.
This GPT-NeoX 20B tutorial differs from the GPT-NeoX 6.9B tutorial in the following ways:
sequence parallel optimization has been applied
parallel cross entropy has been applied
the model size has been increased from 6.9B to 20B
the TP degree has been increased from 8 to 32
Setting up environment is same as the GPT-NeoX 6.9B tutorial.
Let’s download the scripts for pretraining:
cd ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/adamw_fp32_optim_params.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/get_dataset.py ./
ln -sf ~/neuronx-distributed/examples/training/tp_dp_gpt_neox_hf_pretrain/common/requirements.txt ./
python3 -m pip install -r requirements.txt
Next let’s download and pre-process the dataset:
python3 get_dataset.py
At this point, you are all set to start training.
Running training
We first pre-compile the graphs using the neuron_parallel_compile.
Let’s run the command below:
sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun neuron_parallel_compile bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"
This script uses a tensor-parallel size of 32.
This will automatically set the zero-1 sharding degree to 4 (4 * 32 workers / tensor_parallel_size).
Once the graphs are compiled we can now run training and observe our loss goes down.
To run the training, we just the above command but without neuron_parallel_compile.
sbatch --exclusive \
--nodes 4 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/tp_dp_gpt_neox_20b_hf_pretrain.sh"
Sequence Parallel
We made the following model level modifications to enable sequence parallel:
turn on
sequence_parallel_enabledofColumnParallelLinearandRowParallelLinearinGPTNeoXAttentionandGPTNeoXMLP;replace torch
LayerNorminGPTNeoXLayerandGPTNeoXModelwith neuronx-distributedLayerNormwithsequence_parallel_enabledturned on;dimension transposition of intermediate states in the forward function of
GPTNeoXAttention.dimension transposition and collective communication of intermediate states in the forward function of
GPTNeoXModel.
In the training training script level, we enable:
all-reduce sequence parallel gradients at the gradient accumulation boundary.
Please check modeling_gpt_neox_nxd.py and tp_dp_gpt_neox_20b_hf_pretrain.py for details.
Parallel Cross Entropy
To enable parallel cross entropy, we made the following model level modeifincations:
replace the
CrossEntropyLosswith neuronx-distributedparallel_cross_entropyin the forward function ofGPTNeoXForCausalLM.use
ColumnParallelLinearfor theembed_outlayer inGPTNeoXForCausalLM.
Please check modeling_gpt_neox_nxd.py for details.
This document is relevant for: Inf2, Trn1, Trn2