Compiling and Deploying Pretrained HuggingFace Pipelines distilBERT with Tensorflow2 Neuron

Introduction

In this tutorial we will compile and deploy distilBERT version of HuggingFace 🤗 Transformers BERT for Inferentia. The full list of HuggingFace’s pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. You can also read about HuggingFace’s pipeline feature here: https://huggingface.co/transformers/main_classes/pipelines.html

This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger, but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.

Setting up your environment:

To run this tutorial, please make sure you deactivate any existing TensorFlow conda environments you already using. Install TensorFlow 2.x by following the instructions at TensorFlow Tutorial Setup Guide.

After following the Setup Guide, you need to change your kernel to Python (Neuron TensorFlow 2) by clicking Kerenel->Change Kernel->Python (Neuron TensorFlow 2)

Now we can install TensorFlow Neuron 2.x, HuggingFace transformers, and HuggingFace datasets dependencies here.

[ ]:
!pip install --upgrade "transformers==4.1.0"
!pip install --upgrade "datasets==1.4.1"
!pip install ipywidgets
[ ]:
from transformers import pipeline
import tensorflow as tf
import tensorflow.neuron as tfn

Compile the model into an AWS Neuron Optimized Model

[ ]:
#Create the huggingface pipeline for sentiment analysis
#this model tries to determine of the input text has a positive
#or a negative sentiment.
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

pipe = pipeline('sentiment-analysis', model=model_name, framework='tf')

#pipelines are extremely easy to use as they do all the tokenization,
#inference and output interpretation for you.
pipe('I love pipelines, they are very easy to use!')

As we’ve seen above, Huggingface’s pipline feature is a great wrapper for running inference on their models. It takes care of the tokenization of the string inputs. Then feeds that tokenized input to the model. Finally it interprets the outputs of the model and formats them in a way that is very human readable. Our goal will be to compile the underlying model inside the pipeline as well as make some edits to the tokenizer. The reason we need to edit the tokenizer is to make sure that we have a standard sequence length (in this case 128) as neuron only accepts static input shapes.

[ ]:
neuron_pipe = pipeline('sentiment-analysis', model=model_name, framework='tf')

#the first step is to modify the underlying tokenizer to create a static
#input shape as inferentia does not work with dynamic input shapes
original_tokenizer = pipe.tokenizer


#we intercept the function call to the original tokenizer
#and inject our own code to modify the arguments
def wrapper_function(*args, **kwargs):
    kwargs['padding'] = 'max_length'
    #this is the key line here to set a static input shape
    #so that all inputs are set to a len of 128
    kwargs['max_length'] = 128
    kwargs['truncation'] = True
    kwargs['return_tensors'] = 'tf'
    return original_tokenizer(*args, **kwargs)

#insert our wrapper function as the new tokenizer as well
#as reinserting back some attribute information that was lost
#when we replaced the original tokenizer with our wrapper function
neuron_pipe.tokenizer = wrapper_function
neuron_pipe.tokenizer.decode = original_tokenizer.decode
neuron_pipe.tokenizer.mask_token_id = original_tokenizer.mask_token_id
neuron_pipe.tokenizer.pad_token_id = original_tokenizer.pad_token_id
neuron_pipe.tokenizer.convert_ids_to_tokens = original_tokenizer.convert_ids_to_tokens


#Now that our neuron_classifier is ready we can use it to
#generate an example input which is needed to compile the model
#note that pipe.model is the actual underlying model itself which
#is what Tensorflow Neuron actually compiles.
example_inputs = neuron_pipe.tokenizer('we can use any string here to generate example inputs')
#compile the model by calling tfn.trace by passing in the underlying model
#and the example inputs generated by our updated tokenizer
neuron_model = tfn.trace(pipe.model, example_inputs)

#comparing a pipeline call to directly calling the underlying model
print("Pipeline call:", neuron_pipe('we can use any string here to generate example inputs'))
print("Direct call:", neuron_pipe.model(example_inputs))

Deploy the model

The great thing about using HuggingFace’s pipeline feature is that once your model is compiled, running inference on it is extremely easy!

[ ]:
#now we can insert the neuron_model and replace the cpu model
#so now we have a huggingface pipeline that uses and underlying neuron model!
neuron_pipe.model = neuron_model
neuron_pipe.model.config = pipe.model.config

#Now let's run inference on neuron!
neuron_pipe('I want this sentence to be negative to show a negative sentiment analysis.')

Using a larger batch size to increase throughput

In the cell below, we use a large batch size for inference. By using a larger batch size, we are able increase our neuroncore usage which increases our throughput.

The only thing we need to do is recompile our model with following the steps we did above, except this time our example inputs for tfn.trace will be batch 128. For this test, we will use Huggingface’s dataset module to query our model.

First we need to generate some batch 128 example inputs to compile our batch 128 model.

[ ]:
from datasets import load_dataset
dataset = load_dataset('amazon_polarity')

string_inputs = dataset['test'][:128]['content']
string_inputs[0]

Next, we recompile the model similar to what we did above, except this time we use a batch 128 example inputs.

[ ]:
neuron_pipe_b128 = pipeline('sentiment-analysis', model=model_name, framework='tf')

neuron_pipe_b128.tokenizer = wrapper_function
neuron_pipe_b128.tokenizer.decode = original_tokenizer.decode
neuron_pipe_b128.tokenizer.mask_token_id = original_tokenizer.mask_token_id
neuron_pipe_b128.tokenizer.pad_token_id = original_tokenizer.pad_token_id
neuron_pipe_b128.tokenizer.convert_ids_to_tokens = original_tokenizer.convert_ids_to_tokens

example_inputs = neuron_pipe.tokenizer(string_inputs)
neuron_model_b128 = tfn.trace(pipe.model, example_inputs)

neuron_pipe_b128.model = neuron_model_b128
neuron_pipe_b128.model.config = pipe.model.config

Now we can do some simple benchmarking of the neuron models and cpu model. We use a warmup inference before benchmarking the neuron model because the first inference call also loads the model onto inferentia. Every inference after that just queries the already loaded model, which is much faster.

[ ]:
import time

#warmup inf
neuron_pipe('warmup inference')
#benchmark batch 1 neuron model
start = time.time()
for string in string_inputs:
    outputs = neuron_pipe(string)
end = time.time()

print(f"Throughput for batch 1 neuron model is {128/(end - start)}.")


#warmup inf
neuron_pipe_b128(string_inputs)
#benchmark batch 128 neuron model
start = time.time()
outputs = neuron_pipe_b128(string_inputs)
end = time.time()

print(f"Throughput for batch 128 neuron model is {128/(end - start)}.")


#benchmark batch 1 CPU model
start = time.time()
for string in string_inputs:
    outputs = pipe(string)
end = time.time()

print(f"Throughput for batch 1 cpu model is {128/(end - start)}.")


#benchmark batch 128 CPU model
start = time.time()
outputs = pipe(string_inputs)
end = time.time()

print(f"Throughput for batch 128 cpu model is {128/(end - start)}.")