This document is relevant for: Trn1
, Trn2
Fine-tune T5 model on Trn1#
Note
Update 01/03/24: This tutorial is currently broken and the AWS Neuron team is working on the fix.
In this tutorial, we show how to fine-tune a Hugging Face (HF) T5 model using HF trainer API. This example fine-tunes a T5 model for a text-summarization task on CNN/DailyMail dataset.
Note
Logs used in tutorials do not present latest performance numbers
For latest performance numbers visit Neuron performance
Setup and compilation#
Before running the tutorial please follow the installation instructions at:
Install PyTorch Neuron on Trn1
Please set the storage of instance to 512GB or more if you also want to run through the BERT pretraining and GPT pretraining tutorials.
For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:
source ~/aws_neuron_venv_pytorch/bin/activate
First we install a recent version of HF transformers, scikit-learn and evaluate packages in our environment as well as download the source matching the installed version. In this example, we chose version 4.26.0 and the text summarization example from HF transformers source:
export HF_VER=4.26.0
pip install -U transformers==$HF_VER datasets evaluate scikit-learn rouge_score pandas==1.4.0
cd ~/
git clone https://github.com/huggingface/transformers --branch v$HF_VER
cd ~/transformers/examples/pytorch/summarization
Single-worker training#
We will run text-summarization fine-tuning task following the example in README.md located in the path ~/transformers/examples/pytorch/summarization.
We use full BF16 casting using XLA_USE_BF16=1 to enable best performance. First, paste the following script into your terminal to create a “run.sh” file and change it to executable:
tee run.sh > /dev/null <<EOF
#!/bin/bash
if [ \$NEURON_PARALLEL_COMPILE == "1" ]
then
XLA_USE_BF16=1 python3 ./run_summarization.py \
--model_name_or_path t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_train \
--do_eval \
--source_prefix "summarize: " \
--max_source_length 512 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--pad_to_max_length \
--max_steps 100 \
--max_eval_samples 100 \
--gradient_accumulation_steps=32 \
--output_dir /tmp/tst-summarization |& tee log_run
else
XLA_USE_BF16=1 python3 ./run_summarization.py \
--model_name_or_path t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_train \
--do_eval \
--source_prefix "summarize: " \
--max_source_length 512 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--pad_to_max_length \
--gradient_accumulation_steps=32 \
--output_dir /tmp/tst-summarization |& tee log_run
fi
EOF
chmod +x run.sh
We optionally precompile the model and training script using neuron_parallel_compile to warm up the persistent graph cache (Neuron Cache) such that the actual run has fewer compilations (faster run time):
neuron_parallel_compile ./run.sh
Note: For these auto-regressive models, do not run the
predict_with_generate
method when doing the precompile step. This is
because the neuron_parallel_compile
utility will run the training
script in graph extraction mode and no actual execution of the graph
will be done. Hence, the outputs at each step are invalid. Since the
auto-regressive generation at each step is dependent on output of
previous step, the generate step would fail since the outputs from
previous steps are invalid.
Precompilation is optional and only needs to be done once unless hyperparameters such as batch size are modified. After the optional precompilation, the actual run will be faster with minimal additional compilations.
./run.sh
If precompilation was not done, the first execution of ./run.sh will be slower due to serial compilations. Rerunning the same script a second time would show quicker execution as the compiled graphs will be already cached in persistent cache.
Running the above script will run the T5-small fine-tuning on a single process.
Note: As you may have noticed, we are not running the
predict_with_generate
as part of training. This is because,
predict_with_generate
requires auto-regressive sampling where the
inputs to the decoder are created by appending outputs of previous
steps. This causes the inputs to the decoder to change shape and thereby
resulting in a new graph. In other words, the current generate
api
provided by HF transformers leads to repeated compilations. We are working on
building a Neuron friendly version of generate
api and it will be
made available as part of future release. This will enable us to run
predict_with_generate
as part of training script.
As a workaround, we can run the predict_with_generate
on CPU after
the model is trained. Once training is completed, a trained checkpoint
would be saved. We can load the trained model and run the
predict_with_generate
to compute the final accuracy.
To do so, in run_summarization.py, add the following before transformers
get imported.
This can be done by adding the below lines before all the imports
:
import libneuronxla
# Disable configuring xla env
def _configure_env():
pass
libneuronxla.configure_environment = _configure_env
You can now run the following and it should run the predict method on CPU device.
NEURON_NUM_DEVICES=0 python3 ./run_summarization.py \
--model_name_or_path <CHECKPOINT_DIR> \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_predict \
--predict_with_generate \
--source_prefix "summarize: " \
--per_device_eval_batch_size 4 \
--max_source_length 512 \
--pad_to_max_length \
--no_cuda \
--output_dir /tmp/tst-summarization |& tee log_run
Note: To run on CPU, we need to make sure that NEURON_NUM_DEVICES is set to 0. This will make sure no xla_devices are created and the trainer would use the default device (CPU).
Multi-worker Training#
The above script will run one worker on one NeuronCore. To run on multiple cores, first add these lines to top of run_summarization.py to disable Distributed Data Parallel (DDP) when using torchrun (see Known issues and limitations section below):
# Disable DDP for torchrun
from transformers import __version__, Trainer
Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model
Then launch the run_summarization.py script with torchrun using –nproc_per_node=N option to specify the number of workers (N=2 for trn1.2xlarge, and N=2, 8, or 32 for trn1.32xlarge). The following example runs 2 workers. Paste the following script into your terminal to create a “run_2w.sh” file and change it to executable:
tee run_2w.sh > /dev/null <<EOF
#!/bin/bash
if [ \$NEURON_PARALLEL_COMPILE == "1" ]
then
XLA_USE_BF16=1 torchrun --nproc_per_node=2 ./run_summarization.py \
--model_name_or_path t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_train \
--do_eval \
--source_prefix "summarize: " \
--max_source_length 512 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--pad_to_max_length \
--max_steps 100 \
--max_eval_samples 100 \
--gradient_accumulation_steps=32 \
--output_dir /tmp/tst-summarization |& tee log_run
else
XLA_USE_BF16=1 torchrun --nproc_per_node=2 ./run_summarization.py \
--model_name_or_path t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_train \
--do_eval \
--source_prefix "summarize: " \
--max_source_length 512 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--pad_to_max_length \
--gradient_accumulation_steps=32 \
--output_dir /tmp/tst-summarization |& tee log_run
fi
EOF
chmod +x run_2w.sh
Again, we optionally precompile the model and training script using neuron_parallel_compile to warm up the persistent graph cache (Neuron Cache), ignoring the results from this precompile run as it is only for extracting and compiling the XLA graphs:
neuron_parallel_compile ./run_2w.sh
Precompilation is optional and only needs to be done once unless hyperparameters such as batch size are modified. After the optional precompilation, the actual run will be faster with minimal additional compilations.
./run_2w.sh
During run, you will notice that the “Total train batch size” is now
8 and the “Total optimization steps” is now half the number for one
worker training. Also, if you open neuron-top
in a separate terminal,
you should see 2 cores been utilized.
To train T5-large model, you can set the model_name_or_path
argument to t5-large
.
Please note, currently running t5-large
on trn1-2xl machine can result in HOST OOM
during
compilation. Hence, it is recommended that you run a t5-large
model training on a trn1-32xl machine.
On a trn1-32xl machine, you can create a run_32w.sh on the terminal using the following commands:
tee run_32w.sh > /dev/null <<EOF
#!/bin/bash
if [ \$NEURON_PARALLEL_COMPILE == "1" ]
then
XLA_USE_BF16=1 torchrun --nproc_per_node=32 ./run_summarization.py \
--model_name_or_path t5-large \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_train \
--do_eval \
--source_prefix "summarize: " \
--max_source_length 512 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--pad_to_max_length \
--max_steps 100 \
--max_eval_samples 100 \
--gradient_accumulation_steps=11 \
--output_dir /tmp/tst-summarization |& tee log_run
else
XLA_USE_BF16=1 torchrun --nproc_per_node=32 ./run_summarization.py \
--model_name_or_path t5-large \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--do_train \
--do_eval \
--source_prefix "summarize: " \
--max_source_length 512 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--overwrite_output_dir \
--pad_to_max_length \
--gradient_accumulation_steps=11 \
--output_dir /tmp/tst-summarization |& tee log_run
fi
EOF
chmod +x run_32w.sh
You can now follow the same steps as listed above. This script would run a t5-large model by launching a training script using 32 data-parallel workers.
Known issues and limitations#
The following are currently known issues:
Long compilation times: this can be alleviated with
neuron_parallel_compile
tool to extract graphs from a short trial run and compile them in parallel ahead of the actual run, as shown above.T5-Large compilation causing processes to get killed on trn1-2xl: It is recommended to
t5-large
model training on a trn1-32xl machine, as it avoids CPU OOM and also provides faster training by making use of 32 data-parallel workers.
This document is relevant for: Trn1
, Trn2