HuggingFace Llama3.1/Llama3-70B Pretraining#
In this example, we will compile and train a HuggingFace Llama3.1/Llama3-70B model
on multiple trn1 or newly launched trn2 instances using ParallelCluster with the NxD Training (NxDT)
library.
The example has the following main sections:
Setting up the environment#
ParallelCluster Setup#
In this example, we will use 16 trn1.32xlarge instances or 8 trn2.48xlarge instances with ParallelCluster. Please follow the instructions here to create a cluster: Train your model on ParallelCluster
ParallelCluster automates the creation of trainium clusters, and provides the Slurm job management system for scheduling and managing distributed training jobs. Please note that the home directory on your ParallelCluster head node will be shared with all of the worker nodes via NFS.
Install Dependencies#
Once you have launched ParallelCluster, please follow this guide on how to install the latest Neuron packages: PyTorch Neuron Setup Guide.
Next, we will need to install NxDT
and its dependencies.
Please see the following installation guide for installing NxDT
:
NxDT Installation Guide
Download the dataset#
Let’s download training-data scripts for our experiments
wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/get_dataset.py
Then download config.json
file:
For Llama-3.1-70B:
wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_pp_llama_hf_pretrain/70B_config_llama3.1/config.json ~/
For Llama-3-70B:
wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed/master/examples/training/llama/tp_pp_llama_hf_pretrain/70B_config_llama3/config.json ~/
To tokenize the data, we must request the tokenizer from Hugging Face and Meta by following the instructions at the following link: HuggingFace Llama 3.1 70B Model .
Use of the Llama models is governed by the Meta license. In order to download the model weights and tokenizer, please visit the above website and accept their License before requesting access. After access has been granted, you may use the following python3 script along with your own hugging face token to download and save the tokenizer.
from huggingface_hub import login
from transformers import AutoTokenizer
login(token='your_own_hugging_face_token')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-70B')
# For llama3 uncomment line below
# tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-70B')
tokenizer.save_pretrained(".")
For Llama3.1/Llama3, make sure your base directory has the following files:
'./tokenizer_config.json', './special_tokens_map.json', './tokenizer.json'
Next, let’s download and pre-process the dataset:
mkdir ~/examples_datasets/
python3 get_dataset.py --llama-version 3
Note: In case you see an error of the following form when downloading data: huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'. Use `repo_type` argument if needed.
This could be because of a stale cache. Try deleting the cache using:
sudo rm -rf ~/.cache/
Pre-compile the model#
By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).
An alternative to the JIT flow is to use the included neuron_parallel_compile
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.
First, clone the open-source neuronx-distributed-training
library
git clone https://github.com/aws-neuron/neuronx-distributed-training
cd neuronx-distributed-training/examples
Now, ensure that you are using the proper config file in the conf/
directory.
In the train.sh
file, ensure that the CONF_FILE
variable is properly
set to the config for the model you want to use. In our case,
it will be hf_llama3_70B_config.yaml
for training on trn1 cluster, and hf_llama3_70B_trn2_config.yaml
for trn2.
In this tutorial, we will train Llama3-70B model on multiple compute nodes. For training on trn1, please make sure hf_llama3_70B_config
has the right configuration:
trainer:
devices: 32
num_nodes: 16
For pretraining on trn2, hf_llama3_70B_trn2_config
would contain:
trainer:
devices: 64
lnc: 2 # default for trn2 workloads
num_nodes: 8
On trn2 instances, the configuration lnc: 2 indicates that there is a 2-to-1 mapping between logical Neuron Core (lnc) and physical Neuron Core. Another supported configuration is lnc: 1, in which case each node would expose 128 logical devices.
The default config here is a 70B parameter model,
but users can also add their own conf/*.yaml
files and run different configs and
hyperparameters if desired. Please see Config Overview
for examples and usage for the .yaml
config files.
On trn1 cluster, run the following commands to launch an AOT pre-compilation job on your instance:
export COMPILE=1
export CONF_FILE=hf_llama3_70B_config
sbatch --exclusive \
--nodes 16 \
--cpus-per-task 128 \
--wrap="srun ./train.sh"
On trn2 cluster, run the following:
export COMPILE=1
export CONF_FILE=hf_llama3_70B_trn2_config
sbatch --exclusive \
--nodes 8 \
--cpus-per-task 128 \
--wrap="srun ./train.sh"
Once you have launched the precompilation job, run the squeue command to view the Slurm job queue on your cluster. If you have not recently run a job on your cluster, it may take 4-5 minutes for the requested trn1.32xlarge or trn2.48xlarge nodes nodes to be launched and initialized. Once the job is running, squeue should show output similar to the following:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7 compute1 wrap ubuntu R 5:11 16 compute1-st-queue1-i1-[1-16]
You can view the output of the precompilation job by examining the file named
slurm-ZZ.out
,
where ZZ represents the JOBID of your job in the squeue output above.
tail -f slurm-7.out
Once the precompilation job is complete, just like the above output you should see a message similar to the following in the logs:
2024-11-07 09:57:13.000144: 39810 INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 36
2024-11-07 09:57:13.000144: 39810 INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 36
2024-11-07 09:57:13.000144: 39810 INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0
At this point, you can press CTRL-C
to exit the tail command.
Note
The number of graphs will differ based on package versions, models, and other factors. This is just an example.
Training the model#
You can launch pre-training job similar to compilation by using the same
training script but now turning off the COMPILE
environment variable
On trn1 ParallelCluster:
export COMPILE=0
export CONF_FILE=hf_llama3_70B_config
sbatch --exclusive \
--nodes 16 \
--cpus-per-task 128 \
--wrap="srun ./train.sh"
On trn2 ParallelCluster:
export COMPILE=0
export CONF_FILE=hf_llama3_70B_trn2_config
sbatch --exclusive \
--nodes 8 \
--cpus-per-task 128 \
--wrap="srun ./train.sh"
As outlined above, you can again use the squeue
command to view the job queue,
and also monitor the job in the same way with the tail
command to see the training logs.
Once the model is loaded onto the Trainium accelerators and training has commenced,
you will begin to see output indicating the job progress:
Example:
Epoch 0: 3%|▎ | 3/91 [16:05<7:52:06, 321.89s/it, loss=6.7, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=1.000, consumed_samples=2048.0]
Epoch 0: 3%|▎ | 3/91 [16:05<7:52:06, 321.89s/it, loss=4.47, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=2.000, consumed_samples=3072.0]
Epoch 0: 4%|▍ | 4/91 [21:20<7:44:18, 320.22s/it, loss=4.47, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=2.000, consumed_samples=3072.0]
Epoch 0: 4%|▍ | 4/91 [21:20<7:44:18, 320.22s/it, loss=3.35, v_num=2, reduced_train_loss=13.40, lr=7.5e-9, parameter_norm=5536.0, global_step=3.000, consumed_samples=4096.0]
Note
The convergence is for demonstration and would differ based on instance type, model, and other factors.
Monitoring Training#
Tensorboard monitoring#
In addition to the text-based job monitoring described in the previous section,
you can also use tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
~/neuronx-distributed-training/examples/nemo_experiments/hf_llama/
.
Once you have identifed the directory, cd
into it, and then launch TensorBoard:
cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama/8/
tensorboard --logdir ./
With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
http://localhost:6006
on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH’ing into the head node,
ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006
neuron-top / neuron-monitor / neuron-ls#
The neuron-top
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run neuron-top
:
ssh compute1-st-queue1-i1-1 # to determine which compute nodes are in use, run the squeue command
neuron-top
Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as neuron-monitor and neuron-ls to capture performance and utilization statistics and to understand NeuronCore allocation.
Continual Pre-training with Downloaded Meta Model Weights#
If you want to perform contiual pre-training using the model weights provided by Meta, follow these steps:
Ensure you have the config.json
file, which should have been downloaded as described in the Download the dataset section.
Download the model and convert the state_dict
to NxDT checkpoint format#
Get the conversion scripts described in the Checkpoint Conversion.
Mention the hf_model_name
argument to specify the HuggingFace model identifier for
the model you want to download and convert the checkpoint to NxDT format.
Run the following to download the model and convert the state_dict
to NxDT sharded checkpoint.
On trn1 cluster:
python3 ./checkpoint_converter_scripts/checkpoint_converter.py \
--model_style hf \
--hf_model_name meta-llama/Meta-Llama-3-70B \
--hw_backend trn1 \
--tp_size 32 --pp_size 8 --n_layers 80 \
--output_dir /fsx/pretrained_weight/ \
--convert_from_full_state --save_xser True \
--kv_size_multiplier 4 --qkv_linear True \
--config ~/config.json
On trn2 cluster:
python3 ./checkpoint_converter_scripts/checkpoint_converter.py \
--model_style hf \
--hf_model_name meta-llama/Meta-Llama-3-70B \
--hw_backend trn2 \
--tp_size 32 --pp_size 4 --n_layers 80 \
--output_dir /fsx/pretrained_weight/ \
--convert_from_full_state --save_xser True \
--kv_size_multiplier 4 --qkv_linear True \
--config ~/config.json
Note
This conversion process requires larger host memory. Please run it on a trn1.32xlarge or trn2.48xlarge compute node. In this example, the converted model is stored on FSx for Lustre to be accessed by all compute nodes.
Start the continual training job by loading converted checkpoints#
In order to start the continual training job with loading this converted model as initial weights, please update the config file (hf_llama3_70B_config.yaml
or hf_llama3_70B_trn2_config.yaml
) as below:
exp_manager:
.
.
resume_from_checkpoint: /fsx/pretrained_weight/ # manually set the checkpoint file to load from
.
.
model:
# Miscellaneous
use_cpu_initialization: False # Init weights on the CPU (slow for large models)
weight_init_only: True
Compared to initial pre-training loss value, you should see lower loss value when the training starts with Meta’s model weights. Logs for one such sample run look like below.
Epoch 0: 3%|▎ | 3/91 [16:09<7:53:59, 323.17s/it, loss=0.834, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=1.000, consumed_samples=2048.0]
Epoch 0: 3%|▎ | 3/91 [16:09<7:53:59, 323.17s/it, loss=0.556, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=2.000, consumed_samples=3072.0]
Epoch 0: 4%|▍ | 4/91 [21:25<7:46:02, 321.41s/it, loss=0.556, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=2.000, consumed_samples=3072.0]
Epoch 0: 4%|▍ | 4/91 [21:25<7:46:02, 321.41s/it, loss=0.417, v_num=7, reduced_train_loss=1.670, lr=7.5e-9, parameter_norm=4736.0, global_step=3.000, consumed_samples=4096.0]
Troubleshooting Guide#
For issues with NxDT
, please see:
NxDT Known Issues