{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformers MarianMT Tutorial\n", "\n", "In this tutorial, you will deploy the [HuggingFace MarianMT](https://huggingface.co/transformers/v4.0.1/model_doc/marian.html) model for text translation.\n", "\n", "This Jupyter notebook should be run on an inf1.6xlarge instance since you will be loading and compiling several large models.\n", "\n", "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page.\n", "\n", "To generate text, you will be using the beam search algorithm to incrementally generate token candidates until the full output text has been created. Unlike simple single-pass models, this algorithm divides the work into two distinct phases:\n", "\n", "- **Encoder**: Convert the input text into an encoded representation. (Executed once)\n", "- **Decoder**: Use the encoded representation of the input text and the current output tokens to incrementally generate the set of next best candidate tokens. (Executed many times)\n", "\n", "In this tutorial you will perform the following steps:\n", "\n", "- **Compile**: Compile both the Encoder and Decoder for Neuron using simplified interfaces for inference.\n", "- **Infer**: Run on CPU and Neuron and compare results.\n", "\n", "Finally, a completely unrolled decoder will be built which simplifies the implementation at the cost of performing fixed-length inferences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Dependencies:\n", "\n", "This tutorial has the following dependencies:\n", "\n", "- `transformers==4.25.1`\n", "- `torch-neuron`\n", "- `sentencepiece`\n", "- `neuron-cc[tensorflow]`\n", "\n", "The following will install the required `transformers` version. Note that encoder/decoder API changes across different minor versions requires that you are specific about the version used. Also note that the `torch-neuron` version is pinned due to `transformer` compatibility issues." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install sentencepiece transformers==4.26.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parameters\n", "\n", "The parameters of a generative model can be tuned for different use-cases. In this example, you'll tailor the parameters to a single inference beam search for an on-demand inference use-case. See the [MarianConfig](https://huggingface.co/transformers/v4.0.1/model_doc/marian.html#marianconfig) for parameter details.\n", "\n", "Rather than varying the encoder/decoder token sizes at runtime, you must define these parameters prior to compilation. The encoder/decoder token sizes are important tunable parameters as a large token sequence will offer greater sentence length flexibility but perform worse than a small token sequence.\n", "\n", "To maximize performance on Neuron, the `num_beams`, `max_encode_length` and `max_decoder_length` should be made as small as possible for the use-case.\n", "\n", "For this tutorial you will use a model that translates sentences of up to 32 token from English to German." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n", "model_name = \"Helsinki-NLP/opus-mt-en-de\" # English -> German model\n", "num_texts = 1 # Number of input texts to decode\n", "num_beams = 4 # Number of beams per input text\n", "max_encoder_length = 32 # Maximum input token length\n", "max_decoder_length = 32 # Maximum output token length" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CPU Model Inference\n", "\n", "Start by executing the model on CPU to test its execution.\n", "\n", "The following defines the inference function which will be used to compare the Neuron and CPU output. In this example you will display all beam search sequences that were generated. For a real on-demand use case, set the `num_beams` to `1` to return only the top result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def infer(model, tokenizer, text):\n", "\n", " # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)\n", " batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors=\"pt\")\n", " output = model.generate(**batch, max_length=max_decoder_length, num_beams=num_beams, num_return_sequences=num_beams)\n", " results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]\n", "\n", " print('Texts:')\n", " for i, summary in enumerate(results):\n", " print(i + 1, summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that after loading the model, we also set the maximum length. This will later be used to limit the size of the compiled model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import MarianMTModel, MarianTokenizer\n", "\n", "model_cpu = MarianMTModel.from_pretrained(model_name)\n", "model_cpu.config.max_length = max_decoder_length\n", "model_cpu.eval()\n", "\n", "tokenizer = MarianTokenizer.from_pretrained(model_name)\n", "\n", "sample_text = \"I am a small frog.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "infer(model_cpu, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Padded Model\n", "In order to perform inference on Neuron, the model must be changed in a way that it supports tracing and fixed-sized inputs. One way in which this is possible is to use a pad the model inputs to the maximum possible tensor sizes. The benefit of using a padded model is that it supports variable length text generation up to a specified length `max_decoder_length`. A consequence of padding is that it can negatively impact performance due to large data transfers.\n", "\n", "### PaddedEncoder & PaddedDecoder Modules\n", "Here you will define wrappers around the encoder and decoder portions of the generation model that are compatible with `torch.jit.trace` as well as fixed-sized inputs.\n", "\n", "The following are important features which are distinct from the default configuration:\n", "\n", "1. Disabled `return_dict`. When this is enabled, the network uses `dataclass` type outputs which are not compatible with `torch.jit.trace`.\n", "2. Disabled `use_cache`. When this option is enabled, the network expects a collection of cache tensors which grow upon each iteration. Since Neuron requires fixed sized inputs, this must be disabled.\n", "3. The `GenerationMixin:beam_search` implementation uses only the logits for the current iteration index from the original decoder layer output. Since inputs must be padded, performance can be improved by selecting only a subset of the hidden state prior to the final linear layer. For efficiency on Neuron, this reduction uses an elementwise-multiply to mask out the unused hidden values and then sums along an axis.\n", "4. Since a reduction step is insterted between the decoder output and the final logit calculation, the original `model` attribute is not used. Instead the `PaddedDecoder` class combines the decoder, reducer, and linear layers into a combined forward pass. In the original model there is a clear distinction between the decoder layer and the final linear layer. These layers are fused together to get one large fully optimized graph." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torch.nn import functional as F\n", "\n", "\n", "class PaddedEncoder(torch.nn.Module):\n", "\n", " def __init__(self, model):\n", " super().__init__()\n", " self.encoder = model.model.encoder\n", " self.main_input_name = 'input_ids'\n", " \n", " def forward(self, input_ids, attention_mask):\n", " return self.encoder(input_ids, attention_mask=attention_mask, return_dict=False)\n", "\n", "\n", "class PaddedDecoder(torch.nn.Module):\n", "\n", " def __init__(self, model):\n", " super().__init__()\n", " self.weight = model.model.shared.weight.clone().detach()\n", " self.bias = model.final_logits_bias.clone().detach()\n", " self.decoder = model.model.decoder\n", "\n", " def forward(self, input_ids, attention_mask, encoder_outputs, index):\n", "\n", " # Invoke the decoder\n", " hidden, = self.decoder(\n", " input_ids=input_ids,\n", " encoder_hidden_states=encoder_outputs,\n", " encoder_attention_mask=attention_mask,\n", " return_dict=False,\n", " use_cache=False,\n", " )\n", "\n", " _, n_length, _ = hidden.shape\n", "\n", " # Create selection mask\n", " mask = torch.arange(n_length, dtype=torch.float32) == index\n", " mask = mask.view(1, -1, 1)\n", "\n", " # Broadcast mask\n", " masked = torch.multiply(hidden, mask)\n", "\n", " # Reduce along 1st dimension\n", " hidden = torch.sum(masked, 1, keepdims=True)\n", "\n", " # Compute final linear layer for token probabilities\n", " logits = F.linear(\n", " hidden,\n", " self.weight,\n", " bias=self.bias\n", " )\n", " return logits\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PaddedGenerator - GenerationMixin Class\n", "\n", "\n", "On text generation tasks, HuggingFace Transformers defines a [GenerationMixin](https://huggingface.co/transformers/v4.0.1/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin) base class which provides standard methods and algorithms to generate text. For this tutorial, you will be using the beam search algorithm on encoder/decoder architectures.\n", "\n", "To be able to use these methods, you will be defining your own class derived from the GenerationMixin class to run a beam search. This will invoke the encoder and decoder layers in a way that is compatible with fixed sized inputs and traced modules. This means you must import the base class and the output objects ([Seq2SeqLMOutput](https://huggingface.co/transformers/v4.0.1/main_classes/output.html#transformers.modeling_outputs.Seq2SeqLMOutput), [BaseModelOutput](https://huggingface.co/transformers/v4.0.1/main_classes/output.html#transformers.modeling_outputs.BaseModelOutput)) used by the [beam_search](https://huggingface.co/transformers/v4.0.1/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.beam_search) algorithm.\n", "\n", "The `GenerationMixin:generate` method will use `GenerationMixin:beam_search` which requires that you to define your own class implementation that invokes the `PaddedEncoder` and `PaddedDecoder` modules using padded inputs. The standard generator model implementation will not work by default because it is intended to infer with variable-sized (growing) input tensors. \n", "\n", "The `from_model` method is defined to create the `PaddedGenerator` from an existing pretrained generator class.\n", "\n", "To invoke the Encoder and Decoder traced modules in a way that is compatible with the `GenerationMixin:beam_search` implementation, the `get_encoder`, `__call__`, and `prepare_inputs_for_generation` methods are overriden.\n", "\n", "Lastly, the class defines methods for serialization so that the model can be easily saved and loaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from transformers import GenerationMixin, AutoConfig\n", "from transformers.modeling_outputs import Seq2SeqLMOutput, BaseModelOutput\n", "from transformers.modeling_utils import PreTrainedModel\n", "\n", "\n", "class PaddedGenerator(PreTrainedModel, GenerationMixin):\n", "\n", " @classmethod\n", " def from_model(cls, model):\n", " generator = cls(model.config)\n", " generator.encoder = PaddedEncoder(model)\n", " generator.decoder = PaddedDecoder(model)\n", " return generator\n", " \n", " def prepare_inputs_for_generation(\n", " self,\n", " input_ids,\n", " encoder_outputs=None,\n", " attention_mask=None,\n", " **kwargs,\n", " ):\n", " # Pad the inputs for Neuron\n", " current_length = input_ids.shape[1]\n", " pad_size = self.config.max_length - current_length\n", " return dict(\n", " input_ids=F.pad(input_ids, (0, pad_size)),\n", " attention_mask=attention_mask,\n", " encoder_outputs=encoder_outputs.last_hidden_state,\n", " current_length=torch.tensor(current_length - 1),\n", " )\n", "\n", " def get_encoder(self):\n", " def encode(input_ids, attention_mask, **kwargs): \n", " output, = self.encoder(input_ids, attention_mask)\n", " return BaseModelOutput(\n", " last_hidden_state=output,\n", " )\n", " return encode\n", "\n", " def forward(self, input_ids, attention_mask, encoder_outputs, current_length, **kwargs):\n", " logits = self.decoder(input_ids, attention_mask, encoder_outputs, current_length)\n", " return Seq2SeqLMOutput(logits=logits)\n", "\n", " @property\n", " def device(self): # Attribute required by beam search\n", " return torch.device('cpu')\n", " \n", " def save_pretrained(self, directory):\n", " if os.path.isfile(directory):\n", " print(f\"Provided path ({directory}) should be a directory, not a file\")\n", " return\n", " os.makedirs(directory, exist_ok=True)\n", " torch.jit.save(self.encoder, os.path.join(directory, 'encoder.pt'))\n", " torch.jit.save(self.decoder, os.path.join(directory, 'decoder.pt'))\n", " self.config.save_pretrained(directory)\n", "\n", " @classmethod\n", " def from_pretrained(cls, directory):\n", " config = AutoConfig.from_pretrained(directory)\n", " obj = cls(config)\n", " obj.encoder = torch.jit.load(os.path.join(directory, 'encoder.pt'))\n", " obj.decoder = torch.jit.load(os.path.join(directory, 'decoder.pt'))\n", " setattr(obj.encoder, 'main_input_name', 'input_ids') # Attribute required by beam search\n", " return obj\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Padded CPU Inference\n", "To start, it is important to ensure that the transformations we have made to the model were successful. Using the classes defined above we can test that the padded model execution on CPU is identical to the original output also running on CPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "padded_model_cpu = PaddedGenerator.from_model(model_cpu)\n", "infer(padded_model_cpu, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Padded Neuron Tracing & Inference\n", "\n", "Now that the padded version of model is confirmed to produce the same outputs as the non-padded version, the model can be compiled for Neuron." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import torch_neuron\n", "\n", "\n", "def trace(model, num_texts, num_beams, max_decoder_length, max_encoder_length):\n", " \"\"\"\n", " Traces the encoder and decoder modules for use on Neuron.\n", "\n", " This function fixes the network to the given sizes. Once the model has been\n", " compiled to a given size, the inputs to these networks must always be of\n", " fixed size.\n", "\n", " Args:\n", " model (PaddedGenerator): The padded generator to compile for Neuron\n", " num_texts (int): The number of input texts to translate at once\n", " num_beams (int): The number of beams to compute per text\n", " max_decoder_length (int): The maximum number of tokens to be generated\n", " max_encoder_length (int): The maximum number of input tokens that will be encoded\n", " \"\"\"\n", "\n", " # Trace the encoder\n", " inputs = (\n", " torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n", " torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n", " )\n", " encoder = torch_neuron.trace(model.encoder, inputs)\n", "\n", " # Trace the decoder (with expanded inputs)\n", " batch_size = num_texts * num_beams\n", " inputs = (\n", " torch.ones((batch_size, max_decoder_length), dtype=torch.long),\n", " torch.ones((batch_size, max_encoder_length), dtype=torch.long),\n", " torch.ones((batch_size, max_encoder_length, model.config.d_model), dtype=torch.float),\n", " torch.tensor(0),\n", " )\n", " decoder = torch_neuron.trace(model.decoder, inputs)\n", " \n", " traced = PaddedGenerator(model.config)\n", " traced.encoder = encoder\n", " traced.decoder = decoder\n", " setattr(encoder, 'main_input_name', 'input_ids') # Attribute required by beam search\n", " return traced" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "padded_model_neuron = trace(padded_model_cpu, num_texts, num_beams, max_decoder_length, max_encoder_length)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing the Neuron execution to the original CPU implementation, you will see the exact same generated text.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# CPU execution for comparison\n", "infer(padded_model_neuron, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Padded Neuron Serialization\n", "Finally, we can test that we can serialize and reload the model so that it can be used later in its precompiled format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "padded_model_neuron.save_pretrained('NeuronPaddedMarianMT')\n", "padded_model_loaded = PaddedGenerator.from_pretrained('NeuronPaddedMarianMT')\n", "infer(padded_model_loaded, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Greedy Unrolled Model\n", "An unrolled version of the model can achieve better performance in some cases since all operations will be executed on the Neuron hardware without returning to CPU. The consequence of this type of model is that since the generation loop execution never returns to CPU, the entire sequence up to `max_decoder_length` is performed in a single forward pass.\n", "\n", "The following module performs greedy text generation. Unlike the original beam search text generation, this implementation always selects the most probable token and does not generate multiple result texts.\n", "\n", "### GreedyUnrolledGenerator Module" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class GreedyUnrolledGenerator(torch.nn.Module):\n", " \n", " def __init__(self, model):\n", " super().__init__()\n", " self.config = model.config\n", " self.model = model\n", " \n", " def forward(self, input_ids, attention_mask):\n", " \n", " # Generate the encoder state for the input tokens. This is only done once and the state is reused.\n", " encoder_outputs, = self.model.model.encoder(input_ids, attention_mask=attention_mask, return_dict=False)\n", " \n", " # Set the intial state for the decode loop. This will grow per decoder iteration\n", " tokens = torch.full((input_ids.size(0), 2), self.config.decoder_start_token_id)\n", " \n", " # Iteratively invoke the decoder on incrementally generated `tokens` to generate a `next_token`.\n", " # Note that unlike the GeneratorMixin.generate function, there is no early-exit if the stop token \n", " # has been reached. This will always run a fixed number of iterations.\n", " for i in range(self.config.max_length):\n", " \n", " hidden, = self.model.model.decoder(\n", " input_ids=tokens,\n", " encoder_hidden_states=encoder_outputs,\n", " encoder_attention_mask=attention_mask,\n", " return_dict=False,\n", " use_cache=False,\n", " ) # size: [batch, current_length, vocab_size]\n", " \n", " logits = F.linear(\n", " hidden[:, -1, :],\n", " self.model.model.shared.weight,\n", " bias=self.model.final_logits_bias\n", " )\n", " next_tokens = torch.argmax(logits, dim=1, keepdims=True)\n", " tokens = torch.cat([tokens, next_tokens], dim=1)\n", " \n", " return tokens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Greedy CPU Inference\n", "The inference code must be updated since the `generate` method is no longer used. This is because the entire generative inference loop occurs within the `GreedyUnrolledGenerator.forward` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def infer_greedy(model, tokenizer, text):\n", " batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors=\"pt\")\n", " inputs = batch['input_ids'], batch['attention_mask']\n", " tokens = greedy_cpu(*inputs)\n", " print('Texts:')\n", " for i, t in enumerate(tokens):\n", " result = tokenizer.decode(t, skip_special_tokens=True)\n", " print(i + 1, result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like in previous section of this tutorial, first the greedy model is executed on CPU to validate that the correct results were produced. In this example, the generated text matches the first result of the original beam search." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_cpu.config.max_length = 8 # This controls the number of decoder loops. Reduced to improve compilation speed.\n", "greedy_cpu = GreedyUnrolledGenerator(model_cpu)\n", "infer_greedy(greedy_cpu, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Greedy Neuron Tracing & Inference\n", "Similarly the tracing is simplified since the now the `GreedyUnrolledGenerator.forward` can be compiled as a single unit. \n", "\n", "For compilation efficiency, two changes will be made compared to normal compilaition:\n", "- `torch.jit.freeze` is used because it can *sometimes* speed up compilation by in the case where a module is re-used multiple times. In this case, it is more efficient because the `self.model.model.decoder` is used in a loop. \n", "- The `torch_neuron.trace` option `fallback` is set to `False`. This forces all operations to execute on Neuron. Most of the time this is not recommended or efficient. In this case, it is more efficient because it means a single subgraph is produced rather than many. Usually one subgraph would be produced per decoder iteration since `aten::embedding` is executed in a loop. The `aten::embedding` operation is otherwise exected on CPU by default since this is usually more efficient than executing on Neuron.\n", "\n", "You may notice that compilation will take significantly longer with the unrolled model since the model inserts new operations into the compute graph for every single decoder iteration. This creates a much larger model graph even though the weights are re-used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "example = (\n", " torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n", " torch.ones((num_texts, max_encoder_length), dtype=torch.long),\n", ")\n", "greedy_cpu.eval()\n", "greedy_trace = torch.jit.trace(greedy_cpu, example)\n", "greedy_frozen = torch.jit.freeze(greedy_trace)\n", "greedy_neuron = torch_neuron.trace(greedy_frozen, example, fallback=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "infer_greedy(greedy_neuron, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Greedy Neuron Serialization\n", "Unlike the previous version of the model that used the `GenerationMixin` base class. This greedy version of the model can be serialized using the regular `torch.jit.save` and `torch.jit.load` utilities since it is a pure torchscript module." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "torch.jit.save(greedy_neuron, 'greedy_neuron.pt')\n", "loaded_greedy_neuron = torch.jit.load('greedy_neuron.pt')\n", "infer_greedy(loaded_greedy_neuron, tokenizer, sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix\n", "### BART (Mask Filling Task)\n", "\n", "These `PaddedGenerator` class can be applied to the BART model for the task of filling in mask tokens.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from transformers import BartForConditionalGeneration, BartTokenizer\n", "bart_name = \"facebook/bart-large\"\n", "bart_model = BartForConditionalGeneration.from_pretrained(bart_name)\n", "bart_model.config.max_length = max_decoder_length\n", "bart_tokenizer = BartTokenizer.from_pretrained(bart_name)\n", "bart_text = \"UN Chief Says There Is No in Syria\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# CPU Execution\n", "infer(bart_model, bart_tokenizer, bart_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Neuron Execution\n", "paddded_bart = PaddedGenerator.from_model(bart_model)\n", "bart_neuron = trace(paddded_bart, num_texts, num_beams, max_decoder_length, max_encoder_length)\n", "infer(bart_neuron, bart_tokenizer, bart_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pegasus (Summarization Task)\n", "\n", "These `PaddedGenerator` class can be applied to the Pegasus model for summarization.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from transformers import PegasusForConditionalGeneration, PegasusTokenizer\n", "pegasus_name = 'google/pegasus-xsum'\n", "pegasus_model = PegasusForConditionalGeneration.from_pretrained(pegasus_name)\n", "pegasus_model.config.max_length = max_decoder_length\n", "pegasus_tokenizer = PegasusTokenizer.from_pretrained(pegasus_name)\n", "pegasus_text = \"PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# CPU Execution\n", "infer(pegasus_model, pegasus_tokenizer, pegasus_text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Neuron Execution\n", "paddded_pegasus = PaddedGenerator.from_model(pegasus_model)\n", "pegasus_neuron = trace(paddded_pegasus, num_texts, num_beams, max_decoder_length, max_encoder_length)\n", "infer(pegasus_neuron, pegasus_tokenizer, pegasus_text)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }