{ "cells": [ { "cell_type": "markdown", "id": "4674f667", "metadata": {}, "source": [ "# Deploy a pretrained PyTorch BERT model from HuggingFace on Amazon SageMaker with Neuron container" ] }, { "cell_type": "markdown", "id": "b3e39838", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "id": "a92c454f", "metadata": {}, "source": [ "In this tutotial we will deploy on SageMaker a pretraine BERT Base model from HuggingFace Transformers, using the [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers). We will use the same model as shown in the [Neuron Tutorial \"PyTorch - HuggingFace Pretrained BERT Tutorial\"](../../../../frameworks/torch/torch-neuronx/tutorials/training/bert.html#). We will compile the model and build a custom AWS Deep Learning Container, to include the HuggingFace Transformers Library. \n", "\n", "This Jupyter Notebook should run on a ml.c5.4xlarge SageMaker Notebook instance. You can set up your SageMaker Notebook instance by following the [Get Started with Amazon SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html) documentation. \n", "\n", "> We recommend increasing the size of the base root volume of you SM notebook instance, to accomodate the models and containers built locally. A root volume of 10Gb should suffice. \n" ] }, { "cell_type": "markdown", "id": "37445ad2", "metadata": {}, "source": [ "## Install Dependencies:" ] }, { "cell_type": "markdown", "id": "3ecd765f", "metadata": {}, "source": [ "This tutorial requires the following pip packages:" ] }, { "cell_type": "markdown", "id": "cae3092c", "metadata": {}, "source": [ "- torch-neuron\n", "- neuron-cc[tensorflow]\n", "- transformers" ] }, { "cell_type": "code", "execution_count": null, "id": "066c3731", "metadata": {}, "outputs": [], "source": [ "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n", "!pip install --upgrade --no-cache-dir torch-neuron neuron-cc[tensorflow] torchvision torch --extra-index-url=https://pip.repos.neuron.amazonaws.com\n", "!pip install --upgrade --no-cache-dir 'transformers==4.6.0'" ] }, { "cell_type": "markdown", "id": "a4796d3a", "metadata": {}, "source": [ "## Compile the model into an AWS Neuron optimized TorchScript" ] }, { "cell_type": "code", "execution_count": null, "id": "6fe85f8e", "metadata": {}, "outputs": [], "source": [ "import torch\n", "import torch_neuron\n", "\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig" ] }, { "cell_type": "code", "execution_count": null, "id": "0c5c253a", "metadata": {}, "outputs": [], "source": [ "# Build tokenizer and model\n", "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n", "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\", return_dict=False)\n", "\n", "# Setup some example inputs\n", "sequence_0 = \"The company HuggingFace is based in New York City\"\n", "sequence_1 = \"Apples are especially bad for your health\"\n", "sequence_2 = \"HuggingFace's headquarters are situated in Manhattan\"\n", "\n", "max_length=128\n", "paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n", "not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n", "\n", "# Run the original PyTorch model on compilation exaple\n", "paraphrase_classification_logits = model(**paraphrase)[0]\n", "\n", "# Convert example inputs to a format that is compatible with TorchScript tracing\n", "example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n", "example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']" ] }, { "cell_type": "code", "execution_count": null, "id": "44255ada", "metadata": {}, "outputs": [], "source": [ "%%time\n", "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n", "# This step may need 3-5 min\n", "model_neuron = torch.neuron.trace(model, example_inputs_paraphrase, verbose=1, compiler_workdir='./compilation_artifacts')" ] }, { "cell_type": "markdown", "id": "5c4752ac", "metadata": {}, "source": [ "You may inspect **model_neuron.graph** to see which part is running on CPU versus running on the accelerator. All native **aten** operators in the graph will be running on CPU." ] }, { "cell_type": "code", "execution_count": null, "id": "dc00889e", "metadata": {}, "outputs": [], "source": [ "# See which part is running on CPU versus running on the accelerator.\n", "print(model_neuron.graph)" ] }, { "cell_type": "markdown", "id": "775fb30d", "metadata": {}, "source": [ "Save the compiled model, so it can be packaged and sent to S3." ] }, { "cell_type": "code", "execution_count": null, "id": "027c4f53", "metadata": {}, "outputs": [], "source": [ "# Save the TorchScript for later use\n", "model_neuron.save('neuron_compiled_model.pt')" ] }, { "cell_type": "markdown", "id": "d362c579", "metadata": {}, "source": [ "### Package the pre-trained model and upload it to S3\n", "\n", "To make the model available for the SageMaker deployment, you will TAR the serialized graph and upload it to the default Amazon S3 bucket for your SageMaker session. " ] }, { "cell_type": "code", "execution_count": null, "id": "29c7f7b4", "metadata": {}, "outputs": [], "source": [ "# Now you'll create a model.tar.gz file to be used by SageMaker endpoint\n", "!tar -czvf model.tar.gz neuron_compiled_model.pt" ] }, { "cell_type": "code", "execution_count": null, "id": "1beadca0", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import time\n", "from sagemaker.utils import name_from_base\n", "import sagemaker" ] }, { "cell_type": "code", "execution_count": null, "id": "06ad87d4", "metadata": {}, "outputs": [], "source": [ "# upload model to S3\n", "role = sagemaker.get_execution_role()\n", "sess=sagemaker.Session()\n", "region=sess.boto_region_name\n", "bucket=sess.default_bucket()\n", "sm_client=boto3.client('sagemaker')" ] }, { "cell_type": "code", "execution_count": null, "id": "5205ec55", "metadata": {}, "outputs": [], "source": [ "model_key = '{}/model/model.tar.gz'.format('inf1_compiled_model')\n", "model_path = 's3://{}/{}'.format(bucket, model_key)\n", "boto3.resource('s3').Bucket(bucket).upload_file('model.tar.gz', model_key)\n", "print(\"Uploaded model to S3:\")\n", "print(model_path)" ] }, { "cell_type": "markdown", "id": "e8b425d4", "metadata": {}, "source": [ "## Build and Push the container" ] }, { "cell_type": "markdown", "id": "430e6ed2", "metadata": {}, "source": [ "The following shell code shows how to build the container image using docker build and push the container image to ECR using docker push.\n", "The Dockerfile in this example is available in the ***container*** folder.\n", "Here's an example of the Dockerfile:\n", "\n", "```Dockerfile\n", "FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuron:1.7.1-neuron-py36-ubuntu18.04\n", "\n", "# Install packages \n", "RUN pip install \"transformers==4.7.0\"\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "3970025d", "metadata": {}, "outputs": [], "source": [ "!cat container/Dockerfile" ] }, { "cell_type": "markdown", "id": "62f78b0f", "metadata": {}, "source": [ "Before running the next cell, make sure your SageMaker IAM role has access to ECR. If not, you can attache the role `AmazonEC2ContainerRegistryPowerUser` to your IAM role ARN, which allows you to upload image layers to ECR. \n", "\n", "It takes 5 minutes to build docker images and upload image to ECR" ] }, { "cell_type": "code", "execution_count": null, "id": "ecd51acf", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "\n", "# The name of our algorithm\n", "algorithm_name=neuron-py36-inference\n", "\n", "cd container\n", "\n", "account=$(aws sts get-caller-identity --query Account --output text)\n", "\n", "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n", "region=$(aws configure get region)\n", "region=${region:-us-west-2}\n", "\n", "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest\"\n", "\n", "# If the repository doesn't exist in ECR, create it.\n", "\n", "aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n", "\n", "if [ $? -ne 0 ]\n", "then\n", " aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n", "fi\n", "\n", "# Get the login command from ECR in order to pull down the SageMaker PyTorch image\n", "aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com\n", "# Build the docker image locally with the image name and then push it to ECR\n", "# with the full name.\n", "docker build -t ${algorithm_name} . --build-arg REGION=${region}\n", "docker tag ${algorithm_name} ${fullname}\n", "\n", "# Get the login command from ECR and execute it directly\n", "aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com\n", "docker push ${fullname}" ] }, { "cell_type": "markdown", "id": "e4f6bbda", "metadata": {}, "source": [ "## Deploy Container and run inference based on the pretrained model" ] }, { "cell_type": "markdown", "id": "64e65e31", "metadata": {}, "source": [ "To deploy a pretrained PyTorch model, you'll need to use the PyTorch estimator object to create a PyTorchModel object and set a different entry_point.\n", "\n", "You'll use the PyTorchModel object to deploy a PyTorchPredictor. This creates a SageMaker Endpoint -- a hosted prediction service that we can use to perform inference." ] }, { "cell_type": "code", "execution_count": null, "id": "f343d3b1", "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "!{sys.executable} -m pip install Transformers" ] }, { "cell_type": "code", "execution_count": null, "id": "2bd73b77", "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import sagemaker\n", "\n", "role = sagemaker.get_execution_role()\n", "sess = sagemaker.Session()\n", "\n", "bucket = sess.default_bucket()\n", "prefix = \"inf1_compiled_model/model\"\n", "\n", "# Get container name in ECR\n", "client=boto3.client('sts')\n", "account=client.get_caller_identity()['Account']\n", "\n", "my_session=boto3.session.Session()\n", "region=my_session.region_name\n", "\n", "algorithm_name=\"neuron-py36-inference\"\n", "ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)\n", "print(ecr_image)" ] }, { "cell_type": "markdown", "id": "9298f2a7", "metadata": {}, "source": [ "An implementation of *model_fn* is required for inference script.\n", "We are going to implement our own **model_fn** and **predict_fn** for Hugging Face Bert, and use default implementations of **input_fn** and **output_fn** defined in sagemaker-pytorch-containers.\n", "\n", "In this example, the inference script is put in ***code*** folder. Run the next cell to see it:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "cfea75b6", "metadata": {}, "outputs": [], "source": [ "!pygmentize code/inference.py" ] }, { "cell_type": "markdown", "id": "1b31a7b8", "metadata": {}, "source": [ "Path of compiled pretrained model in S3:" ] }, { "cell_type": "code", "execution_count": null, "id": "61f3556e", "metadata": {}, "outputs": [], "source": [ "key = os.path.join(prefix, \"model.tar.gz\")\n", "pretrained_model_data = \"s3://{}/{}\".format(bucket, key)\n", "print(pretrained_model_data)" ] }, { "cell_type": "markdown", "id": "e7557a5f", "metadata": {}, "source": [ "The model object is defined by using the SageMaker Python SDK's PyTorchModel and pass in the model from the estimator and the entry_point. The endpoint's entry point for inference is defined by model_fn as seen in the previous code block that prints out **inference.py**. The model_fn function will load the model and required tokenizer.\n", "\n", "Note, **image_uri** must be user's own ECR images." ] }, { "cell_type": "code", "execution_count": null, "id": "0bd99768", "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch.model import PyTorchModel\n", "\n", "pytorch_model = PyTorchModel(\n", " model_data=pretrained_model_data,\n", " role=role,\n", " source_dir=\"code\",\n", " framework_version=\"1.7.1\",\n", " entry_point=\"inference.py\",\n", " image_uri=ecr_image\n", ")\n", "\n", "# Let SageMaker know that we've already compiled the model via neuron-cc\n", "pytorch_model._is_compiled_model = True" ] }, { "cell_type": "markdown", "id": "67439fe7", "metadata": {}, "source": [ "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint.\n", "\n", "Here you will deploy the model to a single **ml.inf1.2xlarge** instance.\n", "It may take 6-10 min to deploy." ] }, { "cell_type": "code", "execution_count": null, "id": "d771fc7c", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "predictor = pytorch_model.deploy(initial_instance_count=1, instance_type=\"ml.inf1.2xlarge\")" ] }, { "cell_type": "code", "execution_count": null, "id": "ab6342f3", "metadata": {}, "outputs": [], "source": [ "print(predictor.endpoint_name)" ] }, { "cell_type": "markdown", "id": "059537d9", "metadata": {}, "source": [ "Since in the input_fn we declared that the incoming requests are json-encoded, we need to use a json serializer, to encode the incoming data into a json string. Also, we declared the return content type to be json string, we Need to use a json deserializer to parse the response." ] }, { "cell_type": "code", "execution_count": null, "id": "29e82f90", "metadata": {}, "outputs": [], "source": [ "predictor.serializer = sagemaker.serializers.JSONSerializer()\n", "predictor.deserializer = sagemaker.deserializers.JSONDeserializer()" ] }, { "cell_type": "markdown", "id": "d006ea03", "metadata": {}, "source": [ "Using a list of sentences, now SageMaker endpoint is invoked to get predictions." ] }, { "cell_type": "code", "execution_count": null, "id": "325a87f8", "metadata": {}, "outputs": [], "source": [ "%%time\n", "result = predictor.predict(\n", " [\n", " \"Never allow the same bug to bite you twice.\",\n", " \"The best part of Amazon SageMaker is that it makes machine learning easy.\",\n", " ]\n", ")\n", "print(result)" ] }, { "cell_type": "code", "execution_count": null, "id": "4a12410d", "metadata": {}, "outputs": [], "source": [ "%%time\n", "result = predictor.predict(\n", " [\n", " \"The company HuggingFace is based in New York City\",\n", " \"HuggingFace's headquarters are situated in Manhattan\",\n", " ]\n", ")\n", "print(result)" ] }, { "cell_type": "markdown", "id": "a72dfd16", "metadata": {}, "source": [ "## Benchmarking your endpoint\n", "\n", "The following cells create a load test for your endpoint. You first define some helper functions: `inference_latency` runs the endpoint request, collects cliend side latency and any errors, `random_sentence` builds random to be sent to the endpoint. " ] }, { "cell_type": "code", "execution_count": null, "id": "088d0e75", "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "import datetime\n", "import math\n", "import time\n", "import boto3 \n", "import matplotlib.pyplot as plt\n", "from joblib import Parallel, delayed\n", "import numpy as np\n", "from tqdm import tqdm\n", "import random" ] }, { "cell_type": "code", "execution_count": null, "id": "038d9953", "metadata": {}, "outputs": [], "source": [ "def inference_latency(model,*inputs):\n", " \"\"\"\n", " infetence_time is a simple method to return the latency of a model inference.\n", "\n", " Parameters:\n", " model: torch model onbject loaded using torch.jit.load\n", " inputs: model() args\n", "\n", " Returns:\n", " latency in seconds\n", " \"\"\"\n", " error = False\n", " start = time.time()\n", " try:\n", " results = model(*inputs)\n", " except:\n", " error = True\n", " results = []\n", " return {'latency':time.time() - start, 'error': error, 'result': results}" ] }, { "cell_type": "code", "execution_count": null, "id": "d6b200ac", "metadata": {}, "outputs": [], "source": [ "def random_sentence():\n", " \n", " s_nouns = [\"A dude\", \"My mom\", \"The king\", \"Some guy\", \"A cat with rabies\", \"A sloth\", \"Your homie\", \"This cool guy my gardener met yesterday\", \"Superman\"]\n", " p_nouns = [\"These dudes\", \"Both of my moms\", \"All the kings of the world\", \"Some guys\", \"All of a cattery's cats\", \"The multitude of sloths living under your bed\", \"Your homies\", \"Like, these, like, all these people\", \"Supermen\"]\n", " s_verbs = [\"eats\", \"kicks\", \"gives\", \"treats\", \"meets with\", \"creates\", \"hacks\", \"configures\", \"spies on\", \"retards\", \"meows on\", \"flees from\", \"tries to automate\", \"explodes\"]\n", " p_verbs = [\"eat\", \"kick\", \"give\", \"treat\", \"meet with\", \"create\", \"hack\", \"configure\", \"spy on\", \"retard\", \"meow on\", \"flee from\", \"try to automate\", \"explode\"]\n", " infinitives = [\"to make a pie.\", \"for no apparent reason.\", \"because the sky is green.\", \"for a disease.\", \"to be able to make toast explode.\", \"to know more about archeology.\"]\n", " \n", " return (random.choice(s_nouns) + ' ' + random.choice(s_verbs) + ' ' + random.choice(s_nouns).lower() or random.choice(p_nouns).lower() + ' ' + random.choice(infinitives))\n", "\n", "print([random_sentence(), random_sentence()])" ] }, { "cell_type": "markdown", "id": "e2945dde", "metadata": {}, "source": [ "The following cell creates `number_of_clients` concurrent threads to run `number_of_runs` requests. Once completed, a `boto3` CloudWatch client will query for the server side latency metrics for comparison. " ] }, { "cell_type": "code", "execution_count": null, "id": "69c047e3", "metadata": {}, "outputs": [], "source": [ "# Defining Auxiliary variables\n", "number_of_clients = 2\n", "number_of_runs = 1000\n", "t = tqdm(range(number_of_runs),position=0, leave=True)\n", "\n", "# Starting parallel clients\n", "cw_start = datetime.datetime.utcnow()\n", "\n", "results = Parallel(n_jobs=number_of_clients,prefer=\"threads\")(delayed(inference_latency)(predictor.predict,[random_sentence(), random_sentence()]) for mod in t)\n", "avg_throughput = t.total/t.format_dict['elapsed']\n", "\n", "cw_end = datetime.datetime.utcnow() \n", "\n", "# Computing metrics and print\n", "latencies = [res['latency'] for res in results]\n", "errors = [res['error'] for res in results]\n", "error_p = sum(errors)/len(errors) *100\n", "p50 = np.quantile(latencies[-1000:],0.50) * 1000\n", "p90 = np.quantile(latencies[-1000:],0.95) * 1000\n", "p95 = np.quantile(latencies[-1000:],0.99) * 1000\n", "\n", "print(f'Avg Throughput: :{avg_throughput:.1f}\\n')\n", "print(f'50th Percentile Latency:{p50:.1f} ms')\n", "print(f'90th Percentile Latency:{p90:.1f} ms')\n", "print(f'95th Percentile Latency:{p95:.1f} ms\\n')\n", "print(f'Errors percentage: {error_p:.1f} %\\n')\n", "\n", "# Querying CloudWatch\n", "print('Getting Cloudwatch:')\n", "cloudwatch = boto3.client('cloudwatch')\n", "statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']\n", "extended=['p50', 'p90', 'p95', 'p100']\n", "\n", "# Give 5 minute buffer to end\n", "cw_end += datetime.timedelta(minutes=5)\n", "\n", "# Period must be 1, 5, 10, 30, or multiple of 60\n", "# Calculate closest multiple of 60 to the total elapsed time\n", "factor = math.ceil((cw_end - cw_start).total_seconds() / 60)\n", "period = factor * 60\n", "print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))\n", "print('Using period of {} seconds\\n'.format(period))\n", "\n", "cloudwatch_ready = False\n", "# Keep polling CloudWatch metrics until datapoints are available\n", "while not cloudwatch_ready:\n", " time.sleep(30)\n", " print('Waiting 30 seconds ...')\n", " # Must use default units of microseconds\n", " model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',\n", " Dimensions=[{'Name': 'EndpointName',\n", " 'Value': predictor.endpoint_name},\n", " {'Name': 'VariantName',\n", " 'Value': \"AllTraffic\"}],\n", " Namespace=\"AWS/SageMaker\",\n", " StartTime=cw_start,\n", " EndTime=cw_end,\n", " Period=period,\n", " Statistics=statistics,\n", " ExtendedStatistics=extended\n", " )\n", " # Should be 1000\n", " if len(model_latency_metrics['Datapoints']) > 0:\n", " print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))\n", " side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs\n", " side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs\n", " side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs\n", " side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs\n", " side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs\n", " \n", " print(f'50th Percentile Latency:{side_p50:.1f} ms')\n", " print(f'90th Percentile Latency:{side_p90:.1f} ms')\n", " print(f'95th Percentile Latency:{side_p95:.1f} ms\\n')\n", "\n", " cloudwatch_ready = True\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "9035e681", "metadata": {}, "source": [ "### Cleanup\n", "Endpoints should be deleted when no longer in use, to avoid costs." ] }, { "cell_type": "code", "execution_count": null, "id": "1284ef3f", "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint(predictor.endpoint)" ] }, { "cell_type": "code", "execution_count": null, "id": "5af53873", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.9 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.9" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }