{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformers MarianMT Tutorial\n", "\n", "In this tutorial, you will deploy the [HuggingFace MarianMT](https://huggingface.co/transformers/v4.0.1/model_doc/marian.html) model for text translation.\n", "\n", "This Jupyter notebook should be run on an inf1.6xlarge instance since you will be loading and compiling several large models.\n", "\n", "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page.\n", "\n", "To generate text, you will be using the beam search algorithm to incrementally generate token candidates until the full output text has been created. Unlike simple single-pass models, this algorithm divides the work into two distinct phases:\n", "\n", "- **Encoder**: Convert the input text into an encoded representation. (Executed once)\n", "- **Decoder**: Use the encoded representation of the input text and the current output tokens to incrementally generate the set of next best candidate tokens. (Executed many times)\n", "\n", "In this tutorial you will perform the following steps:\n", "\n", "- **Compile**: Compile both the Encoder and Decoder for Neuron using simplified interfaces for inference.\n", "- **Infer**: Run on CPU and Neuron and compare results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Dependencies:\n", "\n", "This tutorial has the following dependencies:\n", "\n", "- `transformers==4.0.1`\n", "- `torch-neuron==1.7.*`\n", "- `sentencepiece`\n", "- `neuron-cc[tensorflow]`\n", "\n", "The following will install the required `transformers` version. Note that encoder/decoder API changes across different minor versions requires that you are specific about the version used. Also note that the `torch-neuron` version is pinned due to `transformer` compatibility issues." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --force-reinstall --extra-index-url=https://pip.repos.neuron.amazonaws.com \"torch-neuron==1.7.*\" \"transformers==4.0.1\" \"protobuf<4\" sentencepiece \"neuron-cc[tensorflow]\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parameters\n", "\n", "The parameters of a generative model can be tuned for different use-cases. In this example, you'll tailor the parameters to a single inference beam search for an on-demand inference use-case. See the [MarianConfig](https://huggingface.co/transformers/v4.0.1/model_doc/marian.html#marianconfig) for parameter details.\n", "\n", "Rather than varying the encoder/decoder token sizes at runtime, you must define these parameters prior to compilation. The encoder/decoder token sizes are important tunable parameters as a large token sequence will offer greater sentence length flexibility but perform worse than a small token sequence.\n", "\n", "To maximize performance on Neuron, the `num_beams`, `max_encode_length` and `max_decoder_length` should be made as small as possible for the use-case.\n", "\n", "For this tutorial you will use a model that translates sentences of up to 32 token from English to German." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_name = \"Helsinki-NLP/opus-mt-en-de\" # English -> German model\n", "num_texts = 1 # Number of input texts to decode\n", "num_beams = 4 # Number of beams per input text\n", "max_encoder_length = 32 # Maximum input token length\n", "max_decoder_length = 32 # Maximum output token length" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports\n", "\n", "On text generation tasks, HuggingFace Transformers defines a [GenerationMixin](https://huggingface.co/transformers/v4.0.1/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin) base class which provides standard methods and algorithms to generate text. For this tutorial, you will be using the beam search algorithm on encoder/decoder architectures.\n", "\n", "To be able to use these methods, you will be defining your own class derived from the GenerationMixin class to run a beam search. This will invoke the encoder and decoder layers in a way that is compatible with fixed sized inputs and traced modules. This means you must import the base class and the output objects ([Seq2SeqLMOutput](https://huggingface.co/transformers/v4.0.1/main_classes/output.html#transformers.modeling_outputs.Seq2SeqLMOutput), [BaseModelOutput](https://huggingface.co/transformers/v4.0.1/main_classes/output.html#transformers.modeling_outputs.BaseModelOutput)) used by the [beam_search](https://huggingface.co/transformers/v4.0.1/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.beam_search) algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import torch\n", "import numpy as np\n", "from torch.nn import functional as F\n", "\n", "from transformers import MarianMTModel, MarianTokenizer, MarianConfig\n", "from transformers.generation_utils import GenerationMixin\n", "from transformers.modeling_outputs import Seq2SeqLMOutput, BaseModelOutput\n", "from transformers.modeling_utils import PreTrainedModel\n", "\n", "import torch_neuron" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CPU Model Execution\n", "\n", "Start by executing the model on CPU to test its execution.\n", "\n", "The following defines the inference function which will be used to compare the Neuron and CPU output. In this example you will display all beam search sequences that were generated. On a real on-demand use case, set the `num_return_sequences` to `1` to return only the top result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def infer(model, tokenizer, text):\n", "\n", " # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)\n", " batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors=\"pt\")\n", " output = model.generate(**batch, max_length=max_decoder_length, num_beams=num_beams, num_return_sequences=num_beams)\n", " results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n", "\n", " print('Texts:')\n", " for i, summary in enumerate(results):\n", " print(i + 1, summary)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_cpu = MarianMTModel.from_pretrained(model_name)\n", "model_cpu.eval()\n", "\n", "tokenizer_cpu = MarianTokenizer.from_pretrained(model_name)\n", "\n", "sample_text = \"I am a small frog.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "infer(model_cpu, tokenizer_cpu, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoder & Decoder Modules\n", "Here you will define wrappers around the encoder and decoder portions of the generation model that are compatible with `torch.jit.trace` as well as fixed-sized inputs.\n", "\n", "Important features which are distinct from the default configuration:\n", "\n", "1. A fixed sized `causal_mask` in the `NeuronDecoder` rather than varying the size for each iteration. This is because Neuron requires padded `input_ids` input rather than the default behavior where this grows for each beam search iteration.\n", "2. Disabled `return_dict`. When this is enabled, the network uses `dataclass` type outputs which are not compatible with `torch.jit.trace`.\n", "3. Disabled `use_cache`. When this option is enabled, the network expects a collection of cache tensors which grow upon each iteration. Since Neuron requires fixed sized inputs, this must be disabled.\n", "4. The `GenerationMixin:beam_search` implementation uses only the logits for the current iteration index from the original decoder layer output. Since inputs are padded, performance can be improved by selecting only a subset of the hidden state prior to the final linear layer. For efficiency on Neuron, this reduction (`reduce`) uses an elementwise-multiply to mask out the unused hidden values and then sums along an axis.\n", "5. Since a reduction step is insterted between the decoder output and the final logit calculation, the original `model` attribute is not used. Instead the `NeuronDecoder` class combines the decoder, reducer, and linear layers into a combined forward pass. In the original model there is a clear distinction between the decoder layer and the final linear layer. These layers are fused together to get one large fully optimized graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def reduce(hidden, index):\n", " _, n_length, _ = hidden.shape\n", "\n", " # Create selection mask\n", " mask = torch.arange(n_length, dtype=torch.float32) == index\n", " mask = mask.view(1, -1, 1)\n", "\n", " # Broadcast mask\n", " masked = torch.multiply(hidden, mask)\n", "\n", " # Reduce along 1st dimension\n", " summed = torch.sum(masked, 1)\n", " return torch.unsqueeze(summed, 1)\n", "\n", "\n", "class NeuronEncoder(torch.nn.Module):\n", "\n", " def __init__(self, model):\n", " super().__init__()\n", " self.encoder = model.model.encoder\n", "\n", " def forward(self, input_ids, attention_mask):\n", " return self.encoder(input_ids, attention_mask=attention_mask, return_dict=False)\n", "\n", "\n", "class NeuronDecoder(torch.nn.Module):\n", "\n", " def __init__(self, model, max_length):\n", " super().__init__()\n", " self.weight = model.model.shared.weight.clone().detach()\n", " self.bias = model.final_logits_bias.clone().detach()\n", " self.decoder = model.model.decoder\n", " self.max_length = max_length\n", "\n", " def forward(self, input_ids, attention_mask, encoder_outputs, index):\n", "\n", " # Build a fixed sized causal mask for the padded decoder input ids\n", " mask = np.triu(np.ones((self.max_length, self.max_length)), 1)\n", " mask[mask == 1] = -np.inf\n", " causal_mask = torch.tensor(mask, dtype=torch.float)\n", "\n", " # Invoke the decoder\n", " hidden, = self.decoder(\n", " input_ids=input_ids,\n", " encoder_hidden_states=encoder_outputs,\n", " encoder_padding_mask=attention_mask,\n", " decoder_padding_mask=None,\n", " decoder_causal_mask=causal_mask,\n", " return_dict=False,\n", " use_cache=False,\n", " )\n", "\n", " # Reduce decoder outputs to the specified index (current iteration)\n", " hidden = reduce(hidden, index)\n", "\n", " # Compute final linear layer for token probabilities\n", " logits = F.linear(\n", " hidden,\n", " self.weight,\n", " bias=self.bias\n", " )\n", " return logits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GenerationMixin Class\n", "To be able to use `GenerationMixin:beam_search` you must define your own class implementation that invokes the traced `NeuronEncoder` and `NeuronDecoder` modules. The standard generator model implementation will not work by default because it is not designed to invoke the traced models with padded inputs.\n", "\n", "Below, the `NeuronGeneration:trace` method uses the loaded generator model and traces both the Encoder and Decoder.\n", "\n", "Next, the following methods are copied directly from the to the original class to ensure that inference behavior is identical:\n", "- `adjust_logits_during_generation`\n", "- `_force_token_id_to_be_generated`\n", "\n", "To invoke the Encoder and Decoder traced modules in a way that is compatible with the `GenerationMixin:beam_search` implementation, the `get_encoder`, `__call__`, and `prepare_inputs_for_generation` methods are overriden.\n", "\n", "Lastly, the class defines methods for serialization so that the model can be easily saved and loaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class NeuronGeneration(PreTrainedModel, GenerationMixin):\n", "\n", " def trace(self, model, num_texts, num_beams, max_encoder_length, max_decoder_length):\n", " \"\"\"\n", " Traces the encoder and decoder modules for use on Neuron.\n", " \n", " This function fixes the network to the given sizes. Once the model has been\n", " compiled to a given size, the inputs to these networks must always be of\n", " fixed size.\n", " \n", " Args:\n", " model (GenerationMixin): The transformer-type generator model to trace\n", " num_texts (int): The number of input texts to translate at once\n", " num_beams (int): The number of beams to computer per text\n", " max_encoder_length (int): The maximum number of encoder tokens\n", " max_encoder_length (int): The maximum number of decoder tokens\n", " \"\"\"\n", " self.config.max_decoder_length = max_decoder_length\n", "\n", " # Trace the encoder\n", " inputs = (\n", " torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n", " torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n", " )\n", " encoder = NeuronEncoder(model)\n", " self.encoder = torch_neuron.trace(encoder, inputs)\n", "\n", " # Trace the decoder (with expanded inputs)\n", " batch_size = num_texts * num_beams\n", " inputs = (\n", " torch.ones((batch_size, max_decoder_length), dtype=torch.long),\n", " torch.ones((batch_size, max_encoder_length), dtype=torch.long),\n", " torch.ones((batch_size, max_encoder_length, model.config.d_model), dtype=torch.float),\n", " torch.tensor(0),\n", " )\n", " decoder = NeuronDecoder(model, max_decoder_length)\n", " self.decoder = torch_neuron.trace(decoder, inputs)\n", "\n", " # ------------------------------------------------------------------------\n", " # Beam Search Methods (Copied directly from transformers)\n", " # ------------------------------------------------------------------------\n", "\n", " def adjust_logits_during_generation(self, logits, cur_len, max_length):\n", " if cur_len == 1 and self.config.force_bos_token_to_be_generated:\n", " self._force_token_id_to_be_generated(logits, self.config.bos_token_id)\n", " elif cur_len == max_length - 1 and self.config.eos_token_id is not None:\n", " self._force_token_id_to_be_generated(logits, self.config.eos_token_id)\n", " return logits\n", "\n", " @staticmethod\n", " def _force_token_id_to_be_generated(scores, token_id) -> None:\n", " scores[:, [x for x in range(scores.shape[1]) if x != token_id]] = -float(\"inf\")\n", " \n", " # ------------------------------------------------------------------------\n", " # Encoder/Decoder Invocation \n", " # ------------------------------------------------------------------------\n", "\n", " def prepare_inputs_for_generation(\n", " self,\n", " decoder_input_ids,\n", " encoder_outputs=None,\n", " attention_mask=None,\n", " **model_kwargs\n", " ):\n", " # Pad the inputs for Neuron\n", " current_length = decoder_input_ids.shape[1]\n", " pad_size = self.config.max_decoder_length - current_length\n", " return dict(\n", " input_ids=F.pad(decoder_input_ids, (0, pad_size)),\n", " attention_mask=attention_mask,\n", " encoder_outputs=encoder_outputs.last_hidden_state,\n", " current_length=torch.tensor(current_length - 1),\n", " )\n", "\n", " def get_encoder(self):\n", " \"\"\"Helper to invoke the encoder and wrap the results in the expected structure\"\"\"\n", " def encode(input_ids, attention_mask, **kwargs):\n", " output, = self.encoder(input_ids, attention_mask)\n", " return BaseModelOutput(\n", " last_hidden_state=output,\n", " )\n", " return encode\n", " \n", " def __call__(self, input_ids, attention_mask, encoder_outputs, current_length, **kwargs):\n", " \"\"\"Helper to invoke the decoder and wrap the results in the expected structure\"\"\"\n", " logits = self.decoder(input_ids, attention_mask, encoder_outputs, current_length)\n", " return Seq2SeqLMOutput(logits=logits)\n", "\n", " # ------------------------------------------------------------------------\n", " # Serialization \n", " # ------------------------------------------------------------------------\n", " \n", " def save_pretrained(self, directory):\n", " if os.path.isfile(directory):\n", " print(f\"Provided path ({directory}) should be a directory, not a file\")\n", " return\n", " os.makedirs(directory, exist_ok=True)\n", " torch.jit.save(self.encoder, os.path.join(directory, 'encoder.pt'))\n", " torch.jit.save(self.decoder, os.path.join(directory, 'decoder.pt'))\n", " self.config.save_pretrained(directory)\n", "\n", " @classmethod\n", " def from_pretrained(cls, directory):\n", " config = MarianConfig.from_pretrained(directory)\n", " obj = cls(config)\n", " obj.encoder = torch.jit.load(os.path.join(directory, 'encoder.pt'))\n", " obj.decoder = torch.jit.load(os.path.join(directory, 'decoder.pt'))\n", " return obj\n", " \n", " @property\n", " def device(self):\n", " return torch.device('cpu')\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Execution\n", "Using everything together from above, now the process to deploy the model is as follows:\n", "\n", "1. Compile the model\n", "2. Serialize an artifact\n", "3. Load the serialized artifact\n", "4. Execute the model Neuron" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# This is the name of the folder where the artifacts will be stored on disk\n", "neuron_name = 'NeuronMarianMT'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_neuron = NeuronGeneration(model_cpu.config)\n", "\n", "# 1. Compile the model\n", "# Note: This may take a couple of minutes since both the encoder/decoder will be compiled\n", "model_neuron.trace(\n", " model=model_cpu,\n", " num_texts=num_texts,\n", " num_beams=num_beams,\n", " max_encoder_length=max_encoder_length,\n", " max_decoder_length=max_decoder_length,\n", ")\n", "\n", "# 2. Serialize an artifact\n", "# After this call you will have an `encoder.pt`, `decoder.pt` and `config.json` in the neuron_name folder\n", "model_neuron.save_pretrained(neuron_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# 3. Load the serialized artifact\n", "model_neuron = NeuronGeneration.from_pretrained(neuron_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 4. Execute the model Neuron\n", "infer(model_neuron, tokenizer_cpu, sample_text)\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Comparing the Neuron execution to the original CPU implementation, you will see the exact same generated text.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# CPU execution for comparison\n", "infer(model_cpu, tokenizer_cpu, sample_text)\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Appendix - BART (Mask Filling Task)\n", "\n", "These `NeuronGeneration` class can be applied to the BART model for the task of filling in mask tokens.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from transformers import BartForConditionalGeneration, BartTokenizer\n", "bart_name = \"facebook/bart-large\"\n", "bart_model = BartForConditionalGeneration.from_pretrained(bart_name, force_bos_token_to_be_generated=True)\n", "bart_tokenizer = BartTokenizer.from_pretrained(bart_name)\n", "bart_text = \"UN Chief Says There Is No in Syria\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# CPU Execution\n", "infer(bart_model, bart_tokenizer, bart_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Neuron Execution\n", "bart_neuron = NeuronGeneration(bart_model.config)\n", "bart_neuron.trace(\n", " model=bart_model,\n", " num_texts=num_texts,\n", " num_beams=num_beams,\n", " max_encoder_length=max_encoder_length,\n", " max_decoder_length=max_decoder_length,\n", ")\n", "infer(bart_neuron, bart_tokenizer, bart_text)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Appendix - Pegasus (Summarization Task)\n", "\n", "These `NeuronGeneration` class can be applied to the Pegasus model for summarization.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from transformers import PegasusForConditionalGeneration, PegasusTokenizer\n", "pegasus_name = 'google/pegasus-xsum'\n", "pegasus_model = PegasusForConditionalGeneration.from_pretrained(pegasus_name)\n", "pegasus_tokenizer = PegasusTokenizer.from_pretrained(pegasus_name)\n", "pegasus_text = \"PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# CPU Execution\n", "infer(pegasus_model, pegasus_tokenizer, pegasus_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Neuron Execution\n", "pegasus_neuron = NeuronGeneration(pegasus_model.config)\n", "pegasus_neuron.trace(\n", " model=pegasus_model,\n", " num_texts=num_texts,\n", " num_beams=num_beams,\n", " max_encoder_length=max_encoder_length,\n", " max_decoder_length=max_decoder_length,\n", ")\n", "infer(pegasus_neuron, pegasus_tokenizer, pegasus_text)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }