.. _torch-hf-t5-finetune: Fine-tune T5 model on Trn1 ================================ .. note:: Update 01/03/24: This tutorial is currently broken and the AWS Neuron team is working on the fix. In this tutorial, we show how to fine-tune a Hugging Face (HF) T5 model using HF trainer API. This example fine-tunes a `T5 model for a text-summarization `__ task on CNN/DailyMail dataset. .. contents:: Table of Contents :local: :depth: 2 .. include:: ../note-performance.txt Setup and compilation --------------------- Before running the tutorial please follow the installation instructions at: :ref:`Install PyTorch Neuron on Trn1 ` Please set the storage of instance to *512GB* or more if you also want to run through the BERT pretraining and GPT pretraining tutorials. For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands: .. code:: shell source ~/aws_neuron_venv_pytorch/bin/activate First we install a recent version of HF transformers, scikit-learn and evaluate packages in our environment as well as download the source matching the installed version. In this example, we chose version 4.26.0 and the text summarization example from HF transformers source: .. code:: bash export HF_VER=4.26.0 pip install -U transformers==$HF_VER datasets evaluate scikit-learn rouge_score pandas==1.4.0 cd ~/ git clone https://github.com/huggingface/transformers --branch v$HF_VER cd ~/transformers/examples/pytorch/summarization Single-worker training ---------------------- We will run text-summarization fine-tuning task following the example in README.md located in the path `~/transformers/examples/pytorch/summarization.` We use full BF16 casting using `XLA_USE_BF16=1` to enable best performance. First, paste the following script into your terminal to create a “run.sh” file and change it to executable: .. code:: ipython3 tee run.sh > /dev/null <`__ to warm up the persistent graph cache (Neuron Cache) such that the actual run has fewer compilations (faster run time): .. code:: ipython3 neuron_parallel_compile ./run.sh Note: For these auto-regressive models, do not run the ``predict_with_generate`` method when doing the precompile step. This is because the ``neuron_parallel_compile`` utility will run the training script in graph extraction mode and no actual execution of the graph will be done. Hence, the outputs at each step are invalid. Since the auto-regressive generation at each step is dependent on output of previous step, the generate step would fail since the outputs from previous steps are invalid. Precompilation is optional and only needs to be done once unless hyperparameters such as batch size are modified. After the optional precompilation, the actual run will be faster with minimal additional compilations. .. code:: ipython3 ./run.sh If precompilation was not done, the first execution of ./run.sh will be slower due to serial compilations. Rerunning the same script a second time would show quicker execution as the compiled graphs will be already cached in persistent cache. Running the above script will run the T5-small fine-tuning on a single process. **Note:** As you may have noticed, we are not running the ``predict_with_generate`` as part of training. This is because, ``predict_with_generate`` requires auto-regressive sampling where the inputs to the decoder are created by appending outputs of previous steps. This causes the inputs to the decoder to change shape and thereby resulting in a new graph. In other words, the current ``generate`` api provided by HF transformers leads to repeated compilations. We are working on building a Neuron friendly version of ``generate`` api and it will be made available as part of future release. This will enable us to run ``predict_with_generate`` as part of training script. As a workaround, we can run the ``predict_with_generate`` on CPU after the model is trained. Once training is completed, a trained checkpoint would be saved. We can load the trained model and run the ``predict_with_generate`` to compute the final accuracy. To do so, in run_summarization.py, add the following before ``transformers`` get imported. This can be done by adding the below lines before all the ``imports``: .. code:: ipython3 import libneuronxla # Disable configuring xla env def _configure_env(): pass libneuronxla.configure_environment = _configure_env You can now run the following and it should run the predict method on CPU device. .. code:: ipython3 NEURON_NUM_DEVICES=0 python3 ./run_summarization.py \ --model_name_or_path \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --do_predict \ --predict_with_generate \ --source_prefix "summarize: " \ --per_device_eval_batch_size 4 \ --max_source_length 512 \ --pad_to_max_length \ --no_cuda \ --output_dir /tmp/tst-summarization |& tee log_run Note: To run on CPU, we need to make sure that NEURON\_NUM\_DEVICES is set to 0. This will make sure no xla\_devices are created and the trainer would use the default device (CPU). .. _multi_worker_training: Multi-worker Training --------------------- The above script will run one worker on one NeuronCore. To run on multiple cores, first add these lines to top of run\_summarization.py to disable Distributed Data Parallel (DDP) when using torchrun (see Known issues and limitations section below): .. code:: ipython3 # Disable DDP for torchrun from transformers import __version__, Trainer Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model Then launch the run\_summarization.py script with torchrun using --nproc\_per\_node=N option to specify the number of workers (N=2 for trn1.2xlarge, and N=2, 8, or 32 for trn1.32xlarge). The following example runs 2 workers. Paste the following script into your terminal to create a “run\_2w.sh” file and change it to executable: .. code:: ipython3 tee run_2w.sh > /dev/null < /dev/null <