HuggingFace Llama3-8B Direct Preference Optimization (DPO) based Fine-tuning#

In this example, we will compile and finetune a pre-trained HF Llama3-8B model on a single instance with the NxD Training (NxDT) library using Direct Preference Optimization (DPO). The pre-trained Llama3-8B model serves as the foundation, and we will build upon this base by fine-tuning and aligning the model to adapt it to a specific task or dataset. The example has the following main sections:

Setting up the environment#

Install Dependencies#

Once you have launched a Trn1 instance, Please follow this guide on how to install the latest Neuron packages: PyTorch Neuron Setup Guide.

Next, we will need to install NxDT and its dependencies. Please see the following installation guide for installing NxDT: NxDT Installation Guide.

We can download the requirements_dpo.txt and install using the command:

pip install -r requirements_dpo.txt

DPO-YAML Configuration Overview#

You can configure a variety of DPO-specific and model parameters for finetuning through the YAML file.

    resume_from_checkpoint: /pretrained_ckpt

    train_dir: /example_datasets/llama3_8b/training.jsonl
    val_dir: null
    dev_choose_samples: 2250
    seq_length: 4096

            kl_beta: 0.01
            loss_type: sigmoid
            max_dpo_prompt_length: 2048
            precompute_ref_log_probs: True
            truncation_mode: keep_start

        type: /llama3_tokenizer

    weight_init_only: True


Manually set the checkpoint file (pretrained/post-SFT checkpoint) to load from

  • Type: str

  • Default: /pretrained_ckpt

  • Required: True (start with pretrained checkpoint)



DPO training data - jsonl or arrow file

As for DPO we use HF style ModelAlignment dataloader, we also use HF style data file paths

  • Type: str

  • Required: True


DPO validation data - jsonl or arrow file

As for DPO we use HF style ModelAlignment dataloader, we also use HF style data file paths

  • Type: str

  • Required: False


If set, will use that many number of records from the head of the dataset instead of using all. Set to null to use full dataset

  • Type: integer

  • Default: null

  • Required: False


Set sequence length for the training job. For DPO, it is total sequence length of prompt and (chosen/rejected) response concatenated together

  • Type: integer

  • Required: True


Set only when using finetuning specific algorithms (SFT, DPO, etc) and related hyperparameters DPO-specific parameters.



KL-divergence beta to control divergence of policy model from reference model

  • Type: float

  • Default: 0.01

  • Required: True


Currently support sigmoid version of optimized DPO loss

  • Type: str

  • Default: sigmoid

  • Required: True


Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input

  • Type: integer

  • Required: True


To enable precomputation of reference model log probabilities using pre-fit hook, False is not supported currently

  • Type: bool

  • Required: True


To define how to truncate if size (prompt+response) exceeds seq_length options: [“keep_start”, “keep_end”]

  • Type: str

  • Default: keep_start`

  • Required: True



Set tokenizer path/type

  • Type: str

  • Default: /llama3_tokenizer

  • Required: True



Load only model states and ignore the optim states from ckpt directory

  • Type: bool

  • Default: True

Download the dataset#

This tutorial makes use of a preprocessed version of intel-orca_dpo_pairs preference dataset that is stored in S3. The dataset can be downloaded to your cluster or instance by running the following AWS CLI commands on the head node or your Trn1 instance:

export DATA_DIR=~/examples_datasets/llama3_8b
mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
aws s3 cp s3://neuron-s3/training_datasets/llama/dpo/data_dpo.jsonl .  --no-sign-request

Convert data to DPO-specific Preference data format#

If you directly downloaded the Intel ORCA_dpo_pairs dataset, then you need to convert the data into preference dataset format using the script below.


For different datasets with different field names, make necessary changes to the script accordingly.

from datasets import load_dataset
from transformers import AutoTokenizer

def preference_data_format(example):

    system = "<|im_start|>\n" + example['system'] + "<|im_end|>\n"

    # Format instruction
    prompt = "<|im_start|> " + example['question'] + "<|im_end|>\n<|im_start|>assistant\n"

    # Format chosen answer
    chosen = example['chosen'] + "<|im_end|>\n"

    # Format rejected answer
    rejected = example['rejected'] + "<|im_end|>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,

# Particular dataset with following fields: "system", "question", "chosen", "rejected"
dataset = load_dataset("json", data_files="orca_rlhf.jsonl", split="train")

# Save columns
original_columns = dataset.column_names

# Format dataset
dataset =

# save converted preference dataset

Download pretrained model checkpoint and tokenizer#

In this tutorial, we will use a pretrained Llama3-8B checkpoint (post-SFT checkpoint preferred) from the original repository. Follow the steps to download tokenizer and model checkpoint from the pretraining stage:

Alternatively, the model checkpoint and tokenizer can also be downloaded from HuggingFace by following this guide

You can also directly download and covert the HuggingFace model checkpoint using Direct HuggingFace Model Conversion

Create a folder llama3_tokenizer and copy the tokenizer contents to it.

Modify the following paths in YAML file based on your specific directory configuration:

  1. model.model_config

  2. exp_manager.resume_from_checkpoint

  3. tokenizer.type

  4. train_dir and val_dir

You can use your Llama model, pretrained checkpoint and tokenizer by modifying the hf_llama3_8B_DPO_config.yaml file.

Checkpoint Conversion#

Follow this Checkpoint Conversion Guide to convert the HF-style Llama3-8B checkpoint to NxDT supported format and store it in pretrained_ckpt directory. Modify the config parameter exp_manager.resume_from_checkpoint path to the converted pretrained checkpoint path.

Pre-compile the model#

By default, PyTorch Neuron uses a just in time (JIT) compilation flow that sequentially compiles all of the neural network compute graphs as they are encountered during a training job. The compiled graphs are cached in a local compiler cache so that subsequent training jobs can leverage the compiled graphs and avoid compilation (so long as the graph signatures and Neuron version have not changed).

An alternative to the JIT flow is to use the included neuron_parallel_compile command to perform ahead of time (AOT) compilation. In the AOT compilation flow, the compute graphs are first identified and extracted during a short simulated training run, and the extracted graphs are then compiled and cached using parallel compilation, which is considerably faster than the JIT flow.

First, clone the open-source neuronx-distributed-training library

Now, ensure that you are using the proper config file in the conf/ directory. In the file, ensure that the CONF_FILE variable is properly set to the config for the model you want to use. In our case, it will be hf_llama3_8B_DPO_config.yaml. The default config here is a 8B parameter model, but users can also add their own conf/*.yaml files and run different configs and hyperparameters if desired. Please see Config Overview for examples and usage for the .yaml config files.

Next, run the following commands to launch an AOT pre-compilation job on your instance:

export COMPILE=1
export CONF_FILE=hf_llama3_8B_DPO_config

The compile output and logs will be shown directly in the terminal and you will see logs similar to this:

2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total graphs: 32
2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total successful compilations: 32
2024-10-24 18:49:49.000950: INFO ||NEURON_PARALLEL_COMPILE||: Total failed compilations: 0

Then, you know your compilation has successfully completed.


The number of graphs will differ based on package versions, models, and other factors. This is just an example.

Training the model#

The fine-tuning job is launched almost exactly in the same way as the compile job. We now turn off the COMPILE environment variable and run the same training script to start pre-training.

On a single instance:

export COMPILE=1
export CONF_FILE=hf_llama3_8B_DPO_config

Once the model is loaded onto the Trainium accelerators and training has commenced, you will begin to see output indicating the job progress:


Epoch 0:   0%|          | 1/250 [00:20<1:26:16, 20.79s/it, loss=2.24, v_num=9-50, reduced_train_loss=2.240, global_step=0.000, consumed_samples=4.000]
Epoch 0:   1%|          | 2/250 [00:38<1:18:33, 19.01s/it, loss=6.45, v_num=9-50, reduced_train_loss=2.240, global_step=1.000, consumed_samples=8.000]
Epoch 0:   1%|          | 3/250 [00:45<1:02:29, 15.18s/it, loss=6.45, v_num=9-50, reduced_train_loss=2.240, global_step=1.000, consumed_samples=8.000]

Monitoring Training#

Tensorboard monitoring#

In addition to the text-based job monitoring described in the previous section, you can also use standard tools such as TensorBoard to monitor training job progress. To view an ongoing training job in TensorBoard, you first need to identify the experiment directory associated with your ongoing job. This will typically be the most recently created directory under ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/. Once you have identifed the directory, cd into it, and then launch TensorBoard:

cd ~/neuronx-distributed-training/examples/nemo_experiments/hf_llama3_8B/
tensorboard --logdir ./

With TensorBoard running, you can then view the TensorBoard dashboard by browsing to http://localhost:6006 on your local machine. If you cannot access TensorBoard at this address, please make sure that you have port-forwarded TCP port 6006 when SSH’ing into the head node,

ssh -i YOUR_KEY.pem ubuntu@HEAD_NODE_IP_ADDRESS -L 6006:

neuron-top / neuron-monitor / neuron-ls#

The neuron-top tool can be used to view useful information about NeuronCore utilization, vCPU and RAM utilization, and loaded graphs on a per-node basis. To use neuron-top during on ongoing training job, run neuron-top:

ssh compute1-dy-queue1-i1-1  # to determine which compute nodes are in use, run the squeue command

Similarly, once you are logged into one of the active compute nodes, you can also use other Neuron tools such as neuron-monitor and neuron-ls to capture performance and utilization statistics and to understand NeuronCore allocation.

Troubleshooting Guide#

For issues with NxDT, please see: NxDT Known Issues