This document is relevant for: Inf2, Trn1, Trn2

BERT TorchServe Tutorial#

Overview #

This tutorial demonstrates the use of TorchServe with Neuron, the SDK for EC2 Inf2 and Trn1 instances. By the end of this tutorial, you will understand how TorchServe can be used to serve a model backed by EC2 Inf2/Trn1 instances. We will use a pretrained BERT-Base model to determine if one sentence is a paraphrase of another.

Run the tutorial #

Open a terminal, log into your remote instance, and activate a Pytorch virtual environment setup (see the:ref:Install PyTorch Neuron <setup-torch-neuronx>). To complete this tutorial, you will also need a compiled BERT model. You can run trace_bert_neuronx.py to obtain a traced BERT model.

You should now have a compiled bert_neuron_b6.pt file, which is required going forward.

Open a shell on the instance you prepared earlier, create a new directory named torchserve. Copy your compiled model from the previous tutorial into this new directory.

cd torchserve
python trace_bert_neuronx.py
ls

bert_neuron_b6.pt

Prepare a new Python virtual environment with the necessary Neuron and TorchServe components. Use a virtual environment to keep (most of) the various tutorial components isolated from the rest of the system in a controlled way.

pip install transformers==4.20.1 torchserve==0.7.0 torch-model-archiver==0.7.0 captum==0.6.0

Install the system requirements for TorchServe.

Amazon Linux 2023 DLAMI Base

sudo dnf -y install jq java-11-amazon-corretto-headless
sudo alternatives --config java
sudo alternatives --config javac

Ubuntu 20 DLAMI Base

sudo apt install openjdk-11-jdk -y

java -version

openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)

javac -version

javac 11.0.17

Verify that TorchServe is now available.

torchserve --version

TorchServe Version is 0.7.0

Setup TorchServe #

During this tutorial you will need to download a few files onto your instance. The simplest way to accomplish this is to paste the download links provided above each file into a wget command. (We don’t provide the links directly because they are subject to change.) For example, right-click and copy the download link for config.json shown below.

Listing 1 config.json#

{
    "model_name": "bert-base-cased-finetuned-mrpc",
    "max_length": 128,
    "batch_size": 6
}

Now execute the following in your shell:

wget <paste link here>
ls

bert_neuron_b6.pt  config.json

Download the custom handler script that will eventually respond to inference requests.

Listing 2 handler_bert_neuronx.py#

import os
import json
import sys
import logging
from abc import ABC

import torch
import torch_neuronx

from transformers import AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler


# one core per worker
os.environ['NEURON_RT_NUM_CORES'] = '1'

logger = logging.getLogger(__name__)

class BertEmbeddingHandler(BaseHandler, ABC):
    """
    Handler class for Bert Embedding computations.
    """
    def __init__(self):
        super(BertEmbeddingHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        self.device = 'cpu'
        model_dir = properties.get('model_dir')
        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = os.path.join(model_dir, serialized_file)

        # point sys.path to our config file
        with open('config.json') as fp:
            config = json.load(fp)
        self.max_length = config['max_length']
        self.batch_size = config['batch_size']
        self.classes = ['not paraphrase', 'paraphrase']

        self.model = torch.jit.load(model_pt_path)
        logger.debug(f'Model loaded from {model_dir}')
        self.model.to(self.device)
        self.model.eval()

        self.tokenizer = AutoTokenizer.from_pretrained(config['model_name'])
        self.initialized = True

    def preprocess(self, input_data):
        """
        Tokenization pre-processing
        """

        input_ids = []
        attention_masks = []
        token_type_ids = []
        for row in input_data:
            seq_0 = row['seq_0'].decode('utf-8')
            seq_1 = row['seq_1'].decode('utf-8')
            logger.debug(f'Received text: "{seq_0}", "{seq_1}"')

            inputs = self.tokenizer.encode_plus(
                    seq_0,
                    seq_1,
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                    )

            input_ids.append(inputs['input_ids'])
            attention_masks.append(inputs['attention_mask'])
            token_type_ids.append(inputs['token_type_ids'])

        batch = (torch.cat(input_ids, 0),
                torch.cat(attention_masks, 0),
                torch.cat(token_type_ids, 0))

        return batch

    def inference(self, inputs):
        """
        Predict the class of a text using a trained transformer model.
        """

        # sanity check dimensions
        assert(len(inputs) == 3)
        num_inferences = len(inputs[0])
        assert(num_inferences <= self.batch_size)

        # insert padding if we received a partial batch
        padding = self.batch_size - num_inferences
        if padding > 0:
            pad = torch.nn.ConstantPad1d((0, 0, 0, padding), value=0)
            inputs = [pad(x) for x in inputs]

        outputs = self.model(*inputs)[0]
        predictions = []
        for i in range(num_inferences):
            prediction = self.classes[outputs[i].argmax(dim=-1).item()]
            predictions.append([prediction])
            logger.debug("Model predicted: '%s'", prediction)
        return predictions

    def postprocess(self, inference_output):
        return inference_output

Next, we need to associate the handler script with the compiled model using torch-model-archiver. Run the following commands in your terminal:

mkdir model_store
MAX_LENGTH=$(jq '.max_length' config.json)
BATCH_SIZE=$(jq '.batch_size' config.json)
MODEL_NAME=bert-max_length$MAX_LENGTH-batch_size$BATCH_SIZE
torch-model-archiver --model-name "$MODEL_NAME" --version 1.0 --serialized-file ./bert_neuron_b6.pt --handler "./handler_bert_neuronx.py" --extra-files "./config.json" --export-path model_store

Note

If you modify your model or a dependency, you will need to rerun the archiver command with the -f flag appended to update the archive.

The result of the above will be a mar file inside the model_store directory.

ls model_store

bert-max_length128-batch_size6.mar

This file is essentially an archive associated with a fixed version of your model along with its dependencies (e.g. the handler code).

Note

The version specified in the torch-model-archiver command can be appended to REST API requests to access a specific version of your model. For example, if your model was hosted locally on port 8080 and named “bert”, the latest version of your model would be available at http://localhost:8080/predictions/bert, while version 1.0 would be accessible at http://localhost:8080/predictions/bert/1.0. We will see how to perform inference using this API in Step 6.

Create a custom config file to set some parameters. This file will be used to configure the server at launch when we run torchserve --start.

Listing 3 torchserve.config#

# bind inference API to all network interfaces with SSL enabled
inference_address=http://0.0.0.0:8080
default_workers_per_model=1

Note

This will cause TorchServe to bind on all interfaces. For security in real-world applications, you’ll probably want to use port 8443 and enable SSL.

Run TorchServe #

It’s time to start the server. Typically we’d want to launch this in a separate console, but for this demo we’ll just redirect output to a file.

torchserve --start --ncs --model-store model_store --ts-config torchserve.config 2>&1 >torchserve.log

Verify that the server seems to have started okay.

curl http://127.0.0.1:8080/ping

{
  "status": "Healthy"
}

Note

If you get an error when trying to ping the server, you may have tried before the server was fully launched. Check torchserve.log for details.

Use the Management API to instruct TorchServe to load our model.

First, determine the number of NeuronCores available based on your instance size.

Inf2

Instance Size	# of NeuronCores
xlarge	2
8xlarge	2
24xlarge	12
48xlarge	24

Trn1

Instance Size	# of NeuronCores
2xlarge	2
32xlarge	32

MAX_BATCH_DELAY=5000 # ms timeout before a partial batch is processed
INITIAL_WORKERS=2 # Number from table above
curl -X POST "http://localhost:8081/models?url=$MODEL_NAME.mar&batch_size=$BATCH_SIZE&initial_workers=$INITIAL_WORKERS&max_batch_delay=$MAX_BATCH_DELAY"

{
  "status": "Model \"bert-max_length128-batch_size6\" Version: 1.0 registered with X initial workers"
}

Warning

You shouldn’t set INITIAL_WORKERS above the number of NeuronCores. If you attempt to load more models than NeuronCores available, one of two things will occur. Either the extra models will fit in device memory but performance will suffer, or you will encounter an error on your initial inference. However, you may want to use fewer cores if you are using the NeuronCore Pipeline feature.

Note

Any additional attempts to configure the model after the initial curl request will cause the server to return a 409 error. You’ll need to stop/start/configure the server to realize any changes.

The MAX_BATCH_DELAY is a timeout value that determines how long to wait before processing a partial batch. This is why the handler code needs to check the batch dimension and potentially add padding. TorchServe will instantiate the number of model handlers indicated by INITIAL_WORKERS, so this value controls how many models we will load onto Inferentia in parallel. If you want to control worker scaling more dynamically, see the docs.

It looks like everything is running successfully at this point, so it’s time for an inference.

Create the infer_bert.py file below on your instance.

Listing 4 infer_bert.py#

import json
import concurrent.futures
import requests

with open('config.json') as fp:
    config = json.load(fp)
max_length = config['max_length']
batch_size = config['batch_size']
name = f'bert-max_length{max_length}-batch_size{batch_size}'

# dispatch requests in parallel
url = f'http://localhost:8080/predictions/{name}'
paraphrase = {'seq_0': "HuggingFace's headquarters are situated in Manhattan",
        'seq_1': "The company HuggingFace is based in New York City"}
not_paraphrase = {'seq_0': paraphrase['seq_0'], 'seq_1': 'This is total nonsense.'}

with concurrent.futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
    def worker_thread(worker_index):
        # we'll send half the requests as not_paraphrase examples for sanity
        data = paraphrase if worker_index < batch_size//2 else not_paraphrase
        try:
            response = requests.post(url, data=data)

            # Check if the response status code indicates success
            if response.status_code == 200:
                print(worker_index, response.json())
            else:
                # If the response is not successful, raise an exception with the status code and error message
                error_message = response.json().get('message', 'Unknown Error')
                raise Exception(f"Failed request with status code {response.status_code}: {error_message}")
        except Exception as e:
            # Catch all other exceptions that may be raised
            print(f"An unexpected error occurred: {e}")
            raise

    for worker_index in range(batch_size):
        executor.submit(worker_thread, worker_index)

This script will send a batch_size number of requests to our model. In this example, we are using a model that estimates the probability that one sentence is a paraphrase of another. The script sends positive examples in the first half of the batch and negative examples in the second half.

Execute the script in your terminal.

python infer_bert.py

['paraphrase']
['not paraphrase']
['not paraphrase']
['paraphrase']
['not paraphrase']
['paraphrase']

We can see that the first three threads (0, 1, 2) all report paraphrase, as expected. If we instead modify the script to send an incomplete batch and then wait for the timeout to expire, the excess padding results will be discarded.

Benchmark TorchServe #

We’ve seen how to perform a single batched inference, but how many inferences can we process per second? A separate upcoming tutorial will document performance tuning to maximize throughput. In the meantime, we can still perform a simple naïve stress test. The code below will spawn 64 worker threads, with each thread repeatedly sending a full batch of data to process. A separate thread will periodically print throughput and latency measurements.

Listing 5 benchmark_bert.py#

import os
import argparse
import time
import numpy as np
import requests
import sys
from concurrent import futures

import torch


parser = argparse.ArgumentParser()
parser.add_argument('--url', help='Torchserve model URL', type=str, default=f'http://127.0.0.1:8080/predictions/bert-max_length128-batch_size6')
parser.add_argument('--num_thread', type=int, default=64, help='Number of threads invoking the model URL')
parser.add_argument('--batch_size', type=int, default=6)
parser.add_argument('--sequence_length', type=int, default=128)
parser.add_argument('--latency_window_size', type=int, default=1000)
parser.add_argument('--throughput_time', type=int, default=300)
parser.add_argument('--throughput_interval', type=int, default=10)
args = parser.parse_args()

data = { 'seq_0': 'A completely made up sentence.',
    'seq_1': 'Well, I suppose they are all made up.' }
live = True
num_infer = 0
latency_list = []


def one_thread(pred, feed_data):
    global latency_list
    global num_infer
    global live
    session = requests.Session()
    while True:
        start = time.time()
        result = session.post(pred, data=feed_data)
        latency = time.time() - start
        latency_list.append(latency)
        num_infer += 1
        if not live:
            break


def current_performance():
    last_num_infer = num_infer
    for _ in range(args.throughput_time // args.throughput_interval):
        current_num_infer = num_infer
        throughput = (current_num_infer - last_num_infer) / args.throughput_interval
        p50 = 0.0
        p90 = 0.0
        if latency_list:
            p50 = np.percentile(latency_list[-args.latency_window_size:], 50)
            p90 = np.percentile(latency_list[-args.latency_window_size:], 90)
        print('pid {}: current throughput {}, latency p50={:.3f} p90={:.3f}'.format(os.getpid(), throughput, p50, p90))
        sys.stdout.flush()
        last_num_infer = current_num_infer
        time.sleep(args.throughput_interval)
    global live
    live = False


with futures.ThreadPoolExecutor(max_workers=args.num_thread+1) as executor:
    executor.submit(current_performance)
    for _ in range(args.num_thread):
        executor.submit(one_thread, args.url, data)

Run the benchmarking script.

python benchmark_bert.py

pid 1214554: current throughput 0.0, latency p50=0.000 p90=0.000
pid 1214554: current throughput 713.9, latency p50=0.071 p90=0.184
pid 1214554: current throughput 737.9, latency p50=0.071 p90=0.184
pid 1214554: current throughput 731.6, latency p50=0.068 p90=0.192
pid 1214554: current throughput 732.2, latency p50=0.070 p90=0.194
pid 1214554: current throughput 733.9, latency p50=0.070 p90=0.187
pid 1214554: current throughput 739.3, latency p50=0.071 p90=0.184
...

Note

Your throughput numbers may differ from these based on instance type and size.

Congratulations! By now you should have successfully served a batched model over TorchServe.

You can now shutdown torchserve.

torchserve --stop

This document is relevant for: Inf2, Trn1, Trn2

BERT TorchServe Tutorial

Contents

BERT TorchServe Tutorial#

Overview#

Run the tutorial#

Setup TorchServe#

Run TorchServe#

Benchmark TorchServe#

Overview #

Run the tutorial #

Setup TorchServe #

Run TorchServe #

Benchmark TorchServe #