{ "cells": [ { "cell_type": "markdown", "id": "e91cf83b", "metadata": {}, "source": [ "# Running Huggingface DistilBERT with TensorFlow-Neuron" ] }, { "cell_type": "markdown", "id": "71394e1e", "metadata": {}, "source": [ "In this tutorial you will compile and deploy DistilBERT version of HuggingFace 🤗 Transformers BERT for Inferentia using TensorFlow-Neuron. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. you can also read about HuggingFace's pipeline feature here: https://huggingface.co/transformers/main_classes/pipelines.html\n", "\n", "This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger, but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs." ] }, { "cell_type": "markdown", "id": "828ef9bd", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "id": "5becc549", "metadata": {}, "source": [ "To run this tutorial please follow the instructions for [TensorFlow-Neuron Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/tensorflow-neuron.html#setup-tensorflow-neuron) and the [Jupyter Notebook Quickstart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) and set your kernel to \"Python (tensorflow-neuron)\" .\n", "\n", "Next, install some additional dependencies." ] }, { "cell_type": "code", "execution_count": null, "id": "ee1a3b84", "metadata": {}, "outputs": [], "source": [ "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n", "!pip install transformers==4.30.2\n", "!pip install ipywidgets" ] }, { "cell_type": "markdown", "id": "c301cfce", "metadata": {}, "source": [ "## Download From Huggingface and Compile for AWS-Neuron" ] }, { "cell_type": "code", "execution_count": null, "id": "92e8050d", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import tensorflow as tf\n", "import tensorflow_neuron as tfn\n", "from transformers import DistilBertTokenizer, TFDistilBertModel\n", "\n", "# Create a wrapper for the roberta model that will accept inputs as a list\n", "# instead of a dictionary. This will allow the compiled model to be saved\n", "# to disk with the model.save() fucntion.\n", "class DistilBertWrapper(tf.keras.Model):\n", " def __init__(self, model):\n", " super().__init__()\n", " self.model = model\n", " def __call__(self, example_inputs):\n", " return self.model({'input_ids' : example_inputs[0], 'attention_mask' : example_inputs[1]})\n", " \n", "\n", "tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')\n", "model = DistilBertWrapper(TFDistilBertModel.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'))\n", "\n", "batch_size = 16\n", "\n", "# create example inputs with a batch size of 16\n", "text = [\"Paris is the of France.\"] * batch_size\n", "encoded_input = tokenizer(text, return_tensors='tf', padding='max_length', max_length=64)\n", "\n", "# turn inputs into a list\n", "example_input = [encoded_input['input_ids'], encoded_input['attention_mask']]\n", "\n", "#compile\n", "model_neuron = tfn.trace(model, example_input)\n", "\n", "print(\"Running on neuron:\", model_neuron(example_input))\n", "\n", "# save the model to disk to save recompilation time for next usage\n", "model_neuron.save('./distilbert-neuron-b16')" ] }, { "cell_type": "markdown", "id": "0f2e159a", "metadata": {}, "source": [ "## Run Basic Inference Benchmarking" ] }, { "cell_type": "code", "execution_count": null, "id": "ccf22e74", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import numpy as np\n", "import concurrent.futures\n", "import time\n", "\n", "reloaded_neuron_model = tf.keras.models.load_model('./distilbert-neuron-b16')\n", "print(\"Reloaded model running on neuron:\", reloaded_neuron_model(example_input))\n", "\n", "num_threads = 4\n", "num_inferences = 1000\n", "\n", "latency_list = []\n", "def inference_with_latency_calculation(example_input):\n", " global latency_list\n", " start = time.time()\n", " result = reloaded_neuron_model(example_input)\n", " end = time.time()\n", " latency_list.append((end-start) * 1000)\n", " return result\n", "\n", "start = time.time()\n", "with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:\n", " futures = []\n", " for i in range(num_inferences):\n", " futures.append(executor.submit(inference_with_latency_calculation, example_input))\n", " for future in concurrent.futures.as_completed(futures):\n", " get_result = future.result()\n", "end = time.time()\n", "\n", "total_time = end - start\n", "throughput = (num_inferences * batch_size)/total_time\n", "\n", "print(f\"Throughput was {throughput} samples per second.\")\n", "print(f\"Latency p50 was {np.percentile(latency_list, 50)} ms\")\n", "print(f\"Latency p90 was {np.percentile(latency_list, 90)} ms\")\n", "print(f\"Latency p95 was {np.percentile(latency_list, 95)} ms\")\n", "print(f\"Latency p99 was {np.percentile(latency_list, 99)} ms\")\n", "assert(throughput >= 1930.0)" ] }, { "cell_type": "code", "execution_count": null, "id": "b31b82fc", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 5 }