{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compiling and Deploying HuggingFace Pretrained BERT\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction\n", "\n", "In this tutorial we will compile and deploy BERT-base version of HuggingFace 🤗 Transformers BERT for Inferentia. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. \n", "\n", "This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger. The compile part of this tutorial requires inf1.6xlarge and not the inference itself. For simplicity we will run this tutorial on inf1.6xlarge but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.\n", "\n", "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Dependencies:\n", "This tutorial requires the following pip packages:\n", "\n", "- `torch-neuron`\n", "- `neuron-cc[tensorflow]`\n", "- `transformers`\n", "\n", "Most of these packages will be installed when configuring your environment using the Neuron PyTorch setup guide. The additional dependencies must be installed here." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n", "!pip install --upgrade \"transformers==4.6.0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compile the model into an AWS Neuron optimized TorchScript\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tensorflow # to workaround a protobuf version conflict issue\n", "import torch\n", "import torch.neuron\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n", "import transformers\n", "import os\n", "import warnings\n", "\n", "# Setting up NeuronCore groups for inf1.6xlarge with 16 cores\n", "num_cores = 16 # This value should be 4 on inf1.xlarge and inf1.2xlarge\n", "os.environ['NEURON_RT_NUM_CORES'] = str(num_cores)\n", "\n", "# Build tokenizer and model\n", "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n", "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\", return_dict=False)\n", "\n", "# Setup some example inputs\n", "sequence_0 = \"The company HuggingFace is based in New York City\"\n", "sequence_1 = \"Apples are especially bad for your health\"\n", "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n", "\n", "max_length=128\n", "paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n", "not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n", "\n", "# Run the original PyTorch model on compilation exaple\n", "paraphrase_classification_logits = model(**paraphrase)[0]\n", "\n", "# Convert example inputs to a format that is compatible with TorchScript tracing\n", "example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n", "example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']\n", "\n", "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n", "model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)\n", "\n", "# Verify the TorchScript works on both example inputs\n", "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n", "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n", "\n", "# Save the TorchScript for later use\n", "model_neuron.save('bert_neuron.pt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may inspect `model_neuron.graph` to see which part is running on CPU versus running on the accelerator. All native `aten` operators in the graph will be running on CPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(model_neuron.graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Deploy the AWS Neuron optimized TorchScript\n", "\n", "To deploy the AWS Neuron optimized TorchScript, you may choose to load the saved TorchScript from disk and skip the slow compilation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load TorchScript back\n", "model_neuron = torch.jit.load('bert_neuron.pt')\n", "# Verify the TorchScript works on both example inputs\n", "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n", "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n", "classes = ['not paraphrase', 'paraphrase']\n", "paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()\n", "not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()\n", "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))\n", "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's run the model in parallel on four cores" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_input_with_padding(batch, batch_size, max_length):\n", " ## Reformulate the batch into three batch tensors - default batch size batches the outer dimension\n", " encoded = batch['encoded']\n", " inputs = torch.squeeze(encoded['input_ids'], 1)\n", " attention = torch.squeeze(encoded['attention_mask'], 1)\n", " token_type = torch.squeeze(encoded['token_type_ids'], 1)\n", " quality = list(map(int, batch['quality']))\n", "\n", " if inputs.size()[0] != batch_size:\n", " print(\"Input size = {} - padding\".format(inputs.size()))\n", " remainder = batch_size - inputs.size()[0]\n", " zeros = torch.zeros( [remainder, max_length], dtype=torch.long )\n", " inputs = torch.cat( [inputs, zeros] )\n", " attention = torch.cat( [attention, zeros] )\n", " token_type = torch.cat( [token_type, zeros] )\n", "\n", " assert(inputs.size()[0] == batch_size and inputs.size()[1] == max_length)\n", " assert(attention.size()[0] == batch_size and attention.size()[1] == max_length)\n", " assert(token_type.size()[0] == batch_size and token_type.size()[1] == max_length)\n", "\n", " return (inputs, attention, token_type), quality\n", "\n", "def count(output, quality):\n", " assert output.size(0) >= len(quality)\n", " correct_count = 0\n", " count = len(quality)\n", " \n", " batch_predictions = [ row.argmax().item() for row in output ]\n", "\n", " for a, b in zip(batch_predictions, quality):\n", " if int(a)==int(b):\n", " correct_count += 1\n", "\n", " return correct_count, count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data parallel inference\n", "In the below cell, we use the data parallel approach for inference. In this approach, we load multiple models, all of them running in parallel. Each model is loaded onto a single NeuronCore. In the below implementation, we launch 16 models, thereby utilizing all the 16 cores on an inf1.6xlarge.\n", "\n", "> Note: Now if you try to decrease the num_cores in the above cells, please restart the notebook and run `!sudo rmmod neuron; sudo modprobe neuron` step in cell 2 to clear the Neuron cores.\n", "\n", "Since, we can run more than 1 model concurrently, the throughput for the system goes up. To achieve maximum gain in throughput, we need to efficiently feed the models so as to keep them busy at all times. In the below setup, this is done by using a producer-consumer model. We maintain a common python queue shared across all the models. The common queue enables feeding data continuously to the models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from parallel import NeuronSimpleDataParallel\n", "from bert_benchmark_utils import BertTestDataset, BertResults\n", "import time\n", "import functools\n", "\n", "max_length = 128\n", "num_cores = 16\n", "batch_size = 1\n", "\n", "tsv_file=\"glue_mrpc_dev.tsv\"\n", "\n", "data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )\n", "data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size, shuffle=True)\n", "\n", "#Result aggregation class (code in bert_benchmark_utils.py)\n", "results = BertResults(batch_size, num_cores)\n", "def result_handler(output, result_id, start, end, input_dict):\n", " correct_count, inference_count = count(output[0], input_dict.pop(result_id))\n", " elapsed = end - start\n", " results.add_result(correct_count, inference_count, [elapsed], [end], [start])\n", "\n", "parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron.pt', num_cores)\n", "\n", "#Starting the inference threads\n", "parallel_neuron_model.start_continuous_inference()\n", "\n", "# Warm up the cores\n", "z = torch.zeros( [batch_size, max_length], dtype=torch.long )\n", "batch = (z, z, z)\n", "for _ in range(num_cores*4):\n", " parallel_neuron_model.infer(batch, -1, None)\n", " \n", "input_dict = {}\n", "input_id = 0\n", "for _ in range(30):\n", " for batch in data_loader:\n", " batch, quality = get_input_with_padding(batch, batch_size, max_length)\n", " input_dict[input_id] = quality\n", " callback_fn = functools.partial(result_handler, input_dict=input_dict)\n", " parallel_neuron_model.infer(batch, input_id, callback_fn)\n", " input_id+=1\n", "\n", "# Stop inference \n", "parallel_neuron_model.stop()\n", "\n", "\n", "with open(\"benchmark.txt\", \"w\") as f:\n", " results.report(f, window_size=1)\n", "\n", "with open(\"benchmark.txt\", \"r\") as f:\n", " for line in f:\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now recompile with a larger batch size of six sentence pairs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "batch_size = 6\n", "\n", "example_inputs_paraphrase = (\n", " torch.cat([paraphrase['input_ids']] * batch_size,0), \n", " torch.cat([paraphrase['attention_mask']] * batch_size,0), \n", " torch.cat([paraphrase['token_type_ids']] * batch_size,0)\n", ")\n", "\n", "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n", "model_neuron_batch = torch.neuron.trace(model, example_inputs_paraphrase)\n", "\n", "## Save the batched model\n", "model_neuron_batch.save('bert_neuron_b{}.pt'.format(batch_size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rerun inference with batch 6" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from parallel import NeuronSimpleDataParallel\n", "from bert_benchmark_utils import BertTestDataset, BertResults\n", "import time\n", "import functools\n", "\n", "max_length = 128\n", "num_cores = 16\n", "batch_size = 6\n", "\n", "data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )\n", "data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size, shuffle=True)\n", "\n", "#Result aggregation class (code in bert_benchmark_utils.py)\n", "results = BertResults(batch_size, num_cores)\n", "def result_handler(output, result_id, start, end, input_dict):\n", " correct_count, inference_count = count(output[0], input_dict.pop(result_id))\n", " elapsed = end - start\n", " results.add_result(correct_count, inference_count, [elapsed], [end], [start])\n", "\n", "parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron_b{}.pt'.format(batch_size), num_cores)\n", "\n", "#Starting the inference threads\n", "parallel_neuron_model.start_continuous_inference()\n", "\n", "# Adding to the input queue to warm all cores\n", "z = torch.zeros( [batch_size, max_length], dtype=torch.long )\n", "batch = (z, z, z)\n", "for _ in range(num_cores*4):\n", " parallel_neuron_model.infer(batch, -1, None)\n", "\n", "input_dict = {}\n", "input_id = 0\n", "for _ in range(30):\n", " for batch in data_loader:\n", " batch, quality = get_input_with_padding(batch, batch_size, max_length)\n", " input_dict[input_id] = quality\n", " callback_fn = functools.partial(result_handler, input_dict=input_dict)\n", " parallel_neuron_model.infer(batch, input_id, callback_fn)\n", " input_id+=1\n", "\n", "# Stop inference \n", "parallel_neuron_model.stop()\n", "\n", "with open(\"benchmark_b{}.txt\".format(batch_size), \"w\") as f:\n", " results.report(f, window_size=1)\n", "\n", "with open(\"benchmark_b{}.txt\".format(batch_size), \"r\") as f:\n", " for line in f:\n", " print(line)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }