HuggingFace Llama3-8B Direct Preference Optimization (DPO) based Fine-tuning#
In this example, we will compile and finetune a pre-trained
HF Llama3-8B model on a single instance with the NxD Training (NxDT)
library
using Direct Preference Optimization (DPO).
The pre-trained Llama3-8B model serves as the foundation, and we will
build upon this base by fine-tuning and aligning the model to adapt
it to a specific task or dataset.
The example has the following main sections:
Setting up the environment#
Install Dependencies#
Once you have launched a Trn1 instance, Please follow this guide on how to install the latest Neuron packages: PyTorch Neuron Setup Guide.
Next, we will need to install NxDT
and its dependencies.
Please see the following installation guide for installing NxDT
:
NxDT Installation Guide.
We can download the requirements_dpo.txt
and install using the command:
wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-training/master/requirements_dpo.txt
pip install -r requirements_dpo.txt
DPO-YAML Configuration Overview#
You can configure a variety of DPO-specific and model parameters for finetuning through the YAML file.
exp_manager:
resume_from_checkpoint: /pretrained_ckpt
data:
train_dir: /example_datasets/llama3_8b/training.jsonl
val_dir: null
dev_choose_samples: 2250
seq_length: 4096
alignment_strategy:
dpo:
kl_beta: 0.01
loss_type: sigmoid
max_dpo_prompt_length: 2048
precompute_ref_log_probs: True
truncation_mode: keep_start
tokenizer:
type: /llama3_tokenizer
model:
weight_init_only: True
- exp_manager
resume_from_checkpoint
Manually set the checkpoint file (pretrained/post-SFT checkpoint) to load from
Type: str
Default:
/pretrained_ckpt
Required: True (start with pretrained checkpoint)
- data
train_dir
DPO training data - jsonl or arrow file
As for DPO we use HF style ModelAlignment dataloader, we also use HF style data file paths
Type: str
Required: True
val_dir
DPO validation data - jsonl or arrow file
As for DPO we use HF style ModelAlignment dataloader, we also use HF style data file paths
Type: str
Required: False
dev_choose_samples
If set, will use that many number of records from the head of the dataset instead of using all. Set to null to use full dataset
Type: integer
Default: null
Required: False
seq_length
Set sequence length for the training job. For DPO, it is total sequence length of prompt and (chosen/rejected) response concatenated together
Type: integer
Required: True
alignment_strategy
Set only when using finetuning specific algorithms (SFT, DPO, etc) and related hyperparameters DPO-specific parameters.
- dpo
kl_beta
KL-divergence beta to control divergence of policy model from reference model
Type: float
Default: 0.01
Required: True
loss_type
Currently support sigmoid version of optimized DPO loss
Type: str
Default:
sigmoid
Required: True
max_dpo_prompt_length
Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input
Type: integer
Required: True
precompute_ref_log_probs
To enable precomputation of reference model log probabilities using pre-fit hook, False is not supported currently
Type: bool
Required: True
truncation_mode
To define how to truncate if size (prompt+response) exceeds seq_length options: [“keep_start”, “keep_end”]
Type: str
Default:
keep_start`
Required: True
- tokenizer
type
Set tokenizer path/type
Type: str
Default:
/llama3_tokenizer
Required: True
- model
weight_init_only
Load only model states and ignore the optim states from ckpt directory
Type: bool
Default: True
Download the dataset#
This tutorial makes use of a preprocessed version of intel-orca_dpo_pairs preference dataset that is stored in S3. The dataset can be downloaded to your cluster or instance by running the following AWS CLI commands on the head node or your Trn1 instance:
export DATA_DIR=~/examples_datasets/llama3_8b
mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
aws s3 cp s3://neuron-s3/training_datasets/llama/dpo/data_dpo.jsonl . --no-sign-request
Convert data to DPO-specific Preference data format#
If you directly downloaded the Intel ORCA_dpo_pairs dataset, then you need to convert the data into preference dataset format using the script below.
Note
For different datasets with different field names, make necessary changes to the script accordingly.
from datasets import load_dataset
from transformers import AutoTokenizer
def preference_data_format(example):
system = "<|im_start|>\n" + example['system'] + "<|im_end|>\n"
# Format instruction
prompt = "<|im_start|> " + example['question'] + "<|im_end|>\n<|im_start|>assistant\n"
# Format chosen answer
chosen = example['chosen'] + "<|im_end|>\n"
# Format rejected answer
rejected = example['rejected'] + "<|im_end|>\n"
return {
"prompt": system + prompt,
"chosen": chosen,
"rejected": rejected,
}
# Particular dataset with following fields: "system", "question", "chosen", "rejected"
dataset = load_dataset("json", data_files="orca_rlhf.jsonl", split="train")
# Save columns
original_columns = dataset.column_names
# Format dataset
dataset = dataset.map(
preference_data_format,
remove_columns=original_columns
)
# save converted preference dataset
dataset.to_json("data_dpo.jsonl")
Download pretrained model checkpoint and tokenizer#
In this tutorial, we will use a pretrained Llama3-8B checkpoint (post-SFT checkpoint preferred) from the original repository. Follow the steps to download tokenizer and model checkpoint from the pretraining stage: https://llama.meta.com/llama-downloads/
Alternatively, the model checkpoint and tokenizer can also be downloaded from HuggingFace by following this guide
You can also directly download and covert the HuggingFace model checkpoint using Direct HuggingFace Model Conversion
Create a folder llama3_tokenizer
and copy the tokenizer contents to it.
Modify the following paths in YAML file based on your specific directory configuration:
model.model_config
exp_manager.resume_from_checkpoint
tokenizer.type
train_dir
andval_dir
You can use your Llama model, pretrained checkpoint and tokenizer by
modifying the hf_llama3_8B_DPO_config.yaml
file.
Checkpoint Conversion#
Follow this Checkpoint Conversion Guide to convert the
HF-style Llama3-8B checkpoint
to NxDT supported format and store it in pretrained_ckpt
directory.
Modify the config parameter exp_manager.resume_from_checkpoint
path to the
converted pretrained checkpoint path.
Pre-compile the model#
By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).
An alternative to the JIT flow is to use the included neuron_parallel_compile
command to perform ahead of time (AOT) compilation. In the AOT compilation flow,
the compute graphs are first identified and extracted during a short simulated training run,
and the extracted graphs are then compiled and cached using parallel compilation,
which is considerably faster than the JIT flow.
First, clone the open-source neuronx-distributed-training
library
Now, ensure that you are using the proper config file in the conf/
directory.
In the train.sh
file, ensure that the CONF_FILE
variable is properly
set to the config for the model you want to use. In our case,
it will be hf_llama3_8B_DPO_config.yaml
. The default config here is a 8B parameter model,
but users can also add their own conf/*.yaml
files and run different configs and
hyperparameters if desired. Please see Config Overview
for examples and usage for the .yaml
config files.
Next, run the following commands to launch an AOT pre-compilation job on your instance:
export COMPILE=1
export CONF_FILE=hf_llama3_8B_DPO_config
./train.sh
The compile output and logs will be shown directly in the terminal and you will see logs similar to this:
2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 32
2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 32
2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0
Then, you know your compilation has successfully completed.
Note
The number of graphs will differ based on package versions, models, and other factors. This is just an example.
Training the model#
The fine-tuning job is launched almost exactly in the same way as the compile job.
We now turn off the COMPILE
environment variable and
run the same training script to start pre-training.
On a single instance:
export COMPILE=1
export CONF_FILE=hf_llama3_8B_DPO_config
./train.sh
Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress:
Example:
Epoch 0: 0%| | 1/250 [00:20<1:26:16, 20.79s/it, loss=2.24, v_num=9-50, reduced_train_loss=2.240, global_step=0.000, consumed_samples=4.000]
Epoch 0: 1%| | 2/250 [00:38<1:18:33, 19.01s/it, loss=6.45, v_num=9-50, reduced_train_loss=2.240, global_step=1.000, consumed_samples=8.000]
Epoch 0: 1%| | 3/250 [00:45<1:02:29, 15.18s/it, loss=6.45, v_num=9-50, reduced_train_loss=2.240, global_step=1.000, consumed_samples=8.000]
Monitoring Training#
Tensorboard monitoring#
In addition to the text-based job monitoring described in the previous section,
you can also use standard tools such as TensorBoard to monitor training job progress.
To view an ongoing training job in TensorBoard, you first need to identify the
experiment directory associated with your ongoing job.
This will typically be the most recently created directory under
~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
.
Once you have identifed the directory, cd into it, and then launch TensorBoard:
cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
tensorboard --logdir ./
With TensorBoard running, you can then view the TensorBoard dashboard by browsing to
http://localhost:6006
on your local machine. If you cannot access TensorBoard at this address,
please make sure that you have port-forwarded TCP port 6006 when SSH’ing into the head node,
ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:127.0.0.1:6006
neuron-top / neuron-monitor / neuron-ls#
The neuron-top
tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization,
and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run neuron-top
:
ssh compute1-dy-queue1-i1-1 # to determine which compute nodes are in use, run the squeue command
neuron-top
Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as neuron-monitor and neuron-ls to capture performance and utilization statistics and to understand NeuronCore allocation.
Troubleshooting Guide#
For issues with NxDT
, please see:
NxDT Known Issues