{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ResNet50 model for Inferentia\n", "\n", "\n", "## Introduction:\n", "\n", "In this tutorial we will compile and deploy a ResNet50 model for inference on Inferentia. \n", "\n", "This Jupyter notebook should run on an inf1.6xlarge instance. The inference part of this tutorial requires an inf1 instance, not the compilation stage. For simplicity we will run this tutorial on an inf1.6xlarge, but in real life scenarios the compilation should be done on a compute instance and the deployment on an inf1 instance to save costs. \n", "\n", "In this tutorial we provide three main sections:\n", "\n", "1. Compile the ResNet50 model and infer with a batch size of 1\n", "\n", "2. Run the same compiled model on multiple NeuronCores using `torch.neuron.DataParallel` and dynamic batching\n", "\n", "3. Compile the ResNet50 model with a batch size of 5 and run it on multiple NeuronCores using `torch.neuron.DataParallel` for optimal performance on Inferentia\n", "\n", "Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](../../../frameworks/torch/torch-neuron/setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Dependencies:\n", "This tutorial requires the following pip packages:\n", "\n", "- `torch>=1.8`\n", "- `torch-neuron`\n", "- `torchvision`\n", "- `neuron-cc[tensorflow]`\n", "\n", "These will be installed by default when configuring your environment using the Neuron PyTorch setup guide." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compile model for Neuron\n", "\n", "The following step will compile the ResNet50 model for Inferentia. This will take a few minutes. At the end of script execution, the compiled model is saved as `resnet50_neuron.pt` in your local directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torchvision import models, transforms, datasets\n", "import torch_neuron\n", "\n", "# Create an example input for compilation\n", "image = torch.zeros([1, 3, 224, 224], dtype=torch.float32)\n", "\n", "# Load a pretrained ResNet50 model\n", "model = models.resnet50(pretrained=True)\n", "\n", "# Tell the model we are using it for evaluation (not training)\n", "model.eval()\n", "\n", "# Analyze the model - this will show operator support and operator count\n", "torch.neuron.analyze_model(model, example_inputs=[image])\n", "\n", "# Compile the model using torch.neuron.trace to create a Neuron model\n", "# that that is optimized for the Inferentia hardware\n", "model_neuron = torch.neuron.trace(model, example_inputs=[image])\n", "\n", "# The output of the compilation step will report the percentage of operators that \n", "# are compiled to Neuron, for example:\n", "#\n", "# INFO:Neuron:The neuron partitioner created 1 sub-graphs\n", "# INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%\n", "# \n", "# We will also be warned if there are operators that are not placed on the Inferentia hardware\n", "\n", "# Save the compiled model\n", "model_neuron.save(\"resnet50_neuron.pt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run inference on Inferentia\n", "\n", "We can use the compiled Neuron model to run inference on Inferentia.\n", "\n", "In the following example, we preprocess a sample image for inference using the CPU model and Neuron model. We compare the predicted labels from the CPU model and Neuron model to verify that they are the same.\n", "\n", "Important: Do not perform inference with a Neuron traced model on a non-Neuron supported instance, as the results will not be calculated properly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define a preprocessing function\n", "\n", "We define a basic image preprocessing function that loads a sample image and labels, normalizes and batches the image, and transforms the image into a tensor for inference using the compiled Neuron model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "from urllib import request\n", "\n", "# Create an image directory containing a sample image of a small kitten\n", "os.makedirs(\"./torch_neuron_test/images\", exist_ok=True)\n", "request.urlretrieve(\"https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg\",\n", " \"./torch_neuron_test/images/kitten_small.jpg\")\n", "\n", "# Fetch labels to output the top classifications\n", "request.urlretrieve(\"https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json\",\"imagenet_class_index.json\")\n", "idx2label = []\n", "\n", "# Read the labels and create a list to hold them for classification \n", "with open(\"imagenet_class_index.json\", \"r\") as read_file:\n", " class_idx = json.load(read_file)\n", " idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def preprocess(batch_size=1, num_neuron_cores=1):\n", " # Define a normalization function using the ImageNet mean and standard deviation\n", " normalize = transforms.Normalize(\n", " mean=[0.485, 0.456, 0.406],\n", " std=[0.229, 0.224, 0.225])\n", "\n", " # Resize the sample image to [1, 3, 224, 224], normalize it, and turn it into a tensor\n", " eval_dataset = datasets.ImageFolder(\n", " os.path.dirname(\"./torch_neuron_test/\"),\n", " transforms.Compose([\n", " transforms.Resize([224, 224]),\n", " transforms.ToTensor(),\n", " normalize,\n", " ])\n", " )\n", " image, _ = eval_dataset[0]\n", " image = torch.tensor(image.numpy()[np.newaxis, ...])\n", "\n", " # Create a \"batched\" image with enough images to go on each of the available NeuronCores\n", " # batch_size is the per-core batch size\n", " # num_neuron_cores is the number of NeuronCores being used\n", " batch_image = image\n", " for i in range(batch_size * num_neuron_cores - 1):\n", " batch_image = torch.cat([batch_image, image], 0)\n", " \n", " return batch_image" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run inference using the Neuron model\n", "\n", "We import the necessary python modules, load the torch-neuron compiled model, and run inference on Inferentia. \n", "\n", "By default, the Neuron model will run on a single NeuronCore. In the next section, we will see how to run the Neuron model on multiple NeuronCores to fully saturate our hardware for optimal performance on Inferentia. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torchvision import models, transforms, datasets\n", "import torch_neuron\n", "\n", "# Get a sample image\n", "image = preprocess()\n", "\n", "# Run inference using the CPU model\n", "output_cpu = model(image)\n", "\n", "# Load the compiled Neuron model\n", "model_neuron = torch.jit.load('resnet50_neuron.pt')\n", "\n", "# Run inference using the Neuron model\n", "output_neuron = model_neuron(image)\n", "\n", "# Verify that the CPU and Neuron predictions are the same by comparing\n", "# the top-5 results\n", "top5_cpu = output_cpu[0].sort()[1][-5:]\n", "top5_neuron = output_neuron[0].sort()[1][-5:]\n", "\n", "# Lookup and print the top-5 labels\n", "top5_labels_cpu = [idx2label[idx] for idx in top5_cpu]\n", "top5_labels_neuron = [idx2label[idx] for idx in top5_neuron]\n", "print(\"CPU top-5 labels: {}\".format(top5_labels_cpu))\n", "print(\"Neuron top-5 labels: {}\".format(top5_labels_neuron))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Inference using torch.neuron.DataParallel\n", "\n", "To fully leverage the Inferentia hardware we want to use all avaialable NeuronCores. An inf1.xlarge and inf1.2xlarge have four NeuronCores, an inf1.6xlarge has 16 NeuronCores, and an inf1.24xlarge has 64 NeuronCores. For maximum performance on Inferentia hardware, we can use `torch.neuron.DataParallel` to utilize all available NeuronCores.\n", "\n", "`torch.neuron.DataParallel` implements data parallelism at the module level by duplicating the Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference.\n", "\n", "In the following section, we will run inference using the `torch.neuron.DataParallel` module to fully saturate the Inferentia hardware. We benchmark the model to collect throughput and latency statistics.\n", "\n", "Note: `torch.neuron.DataParallel` is new with Neuron 1.16.0. Please ensure you are using the latest Neuron package to run the following sections. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define a benchmarking function\n", "\n", "We create a function that handles benchmarking the Neuron model to collect throughput and latency metrics. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import time\n", "\n", "def benchmark(model, image):\n", " print('Input image shape is {}'.format(list(image.shape)))\n", " \n", " # The first inference loads the model so exclude it from timing \n", " results = model(image)\n", " \n", " # Collect throughput and latency metrics\n", " latency = []\n", " throughput = []\n", "\n", " # Run inference for 100 iterations and calculate metrics\n", " num_infers = 100\n", " for _ in range(num_infers):\n", " delta_start = time()\n", " results = model(image)\n", " delta = time() - delta_start\n", " latency.append(delta)\n", " throughput.append(image.size(0)/delta)\n", " \n", " # Calculate and print the model throughput and latency\n", " print(\"Avg. Throughput: {:.0f}, Max Throughput: {:.0f}\".format(np.mean(throughput), np.max(throughput)))\n", " print(\"Latency P50: {:.0f}\".format(np.percentile(latency, 50)*1000.0))\n", " print(\"Latency P90: {:.0f}\".format(np.percentile(latency, 90)*1000.0))\n", " print(\"Latency P95: {:.0f}\".format(np.percentile(latency, 95)*1000.0))\n", " print(\"Latency P99: {:.0f}\\n\".format(np.percentile(latency, 99)*1000.0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run Inference using torch.neuron.DataParallel\n", "\n", "We create the `torch.neuron.DataParallel` module using the compiled Neuron model, get a sample image, and benchmark the parallelized model on Neuron." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a torch.neuron.DataParallel module using the compiled Neuron model\n", "# By default, torch.neuron.DataParallel will use four cores on an inf1.xlarge\n", "# or inf1.2xlarge, 16 cores on an inf1.6xlarge, and 24 cores on an inf1.24xlarge\n", "model_neuron_parallel = torch.neuron.DataParallel(model_neuron)\n", "\n", "# Get sample image with batch size=1 per NeuronCore\n", "batch_size = 1\n", "\n", "# For an inf1.xlarge or inf1.2xlarge, set num_neuron_cores = 4\n", "num_neuron_cores = 16\n", "\n", "image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)\n", "\n", "# Benchmark the model\n", "benchmark(model_neuron_parallel, image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run inference with dynamic batch sizes\n", "\n", "Batch size has a direct impact on model performance. The Inferentia chip is optimized to run with small batch sizes. This means that a Neuron compiled model can outperform a GPU model, even if running single digit batch sizes.\n", "\n", "As a general best practice, we recommend optimizing your model's throughput by compiling the model with a small batch size and gradually increasing it to find the peak throughput on Inferentia.\n", "\n", "Dynamic batching is a feature that allows you to use tensor batch sizes that the Neuron model was not originally compiled against. This is necessary because the underlying Inferentia hardware will always execute inferences with the batch size used during compilation. Fixed batch size execution allows tuning the input batch size for optimal performance. For example, batch size 1 may be best suited for an ultra-low latency on-demand inference application, while batch size > 1 can be used to maximize throughput for offline inferencing. Dynamic batching is implemented by slicing large input tensors into chunks that match the batch size used during the `torch.neuron.trace` compilation call. \n", "\n", "The `torch.neuron.DataParallel` class automatically enables dynamic batching on eligible models. This allows us to run inference in applications that have inputs with a variable batch size without needing to recompile the model.\n", "\n", "In the following example, we use the same `torch.neuron.DataParallel` module to run inference using several different batch sizes. Notice that latency increases consistently as the batch size increases. Throughput increases as well, up until a certain point where the input size becomes too large to be efficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using the same DataParallel model_neuron_parallel model, we can run\n", "# inference on inputs with a variable batch size without recompiling\n", "batch_sizes = [2, 3, 4, 5, 6, 7]\n", "for batch_size in batch_sizes:\n", " print('Batch size: {}'.format(batch_size))\n", " image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)\n", " \n", " # Benchmark the model for each input batch size\n", " benchmark(model_neuron_parallel, image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compile and Infer with different batch sizes on multiple NeuronCores\n", "\n", "Dynamic batching using small batch sizes can result in sub-optimal throughput because it involves slicing tensors into chunks and iteratively sending data to the hardware. Using a larger batch size at compilation time can use the Inferentia hardware more efficiently in order to maximize throughput. You can test the tradeoff between individual request latency and total throughput by fine-tuning the input batch size.\n", "\n", "In the following example, we recompile our model using a batch size of 5 and run the model using `torch.neuron.DataParallel` to fully saturate our Inferentia hardware for optimal performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create an input with batch size 5 for compilation\n", "batch_size = 5\n", "image = torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)\n", "\n", "# Recompile the ResNet50 model for inference with batch size 5\n", "model_neuron = torch.neuron.trace(model, example_inputs=[image])\n", "\n", "# Export to saved model\n", "model_neuron.save(\"resnet50_neuron_b{}.pt\".format(batch_size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run inference with batch size of 5 using the Neuron model compiled for a batch size of 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "batch_size = 5\n", "\n", "# Load compiled Neuron model\n", "model_neuron = torch.jit.load(\"resnet50_neuron_b{}.pt\".format(batch_size))\n", "\n", "# Create DataParallel model\n", "model_neuron_parallel = torch.neuron.DataParallel(model_neuron)\n", "\n", "# Get sample image with batch size=5\n", "image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)\n", "\n", "# Benchmark the model\n", "benchmark(model_neuron_parallel, image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can experiment with different batch size values to see what gives the best overall throughput on Inferentia." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }