Resnet50 model for Inferentia

Introduction:

In this tutorial we will compile and deploy Resnet-50 model for Inferentia. This Jupyter notebook should be run on an instance which is inf1.6xlarge or larger. The compile part of this tutorial requires inf1.6xlarge and not the inference itself. For simplicity we will run this tutorial on inf1.6xlarge but in real life scenario the compilation should be done on a compute instance and the deployment on inf1 instance to save costs.

In this tutorial we provide three main sections:

  1. Compile the Resnet50 model and infer with batch size of 1

  2. Run the same compiled model on multiple cores

  3. Compile the Resnet50 model with batch size 5 and run it on multiple cores

Before running the following verify this Jupyter notebook is running “conda_aws_neuron_pytorch_p36” kernel. You can select the Kernel from the “Kernel -> Change Kernel” option on the top of this Jupyter notebook page.

Install Dependencies:

This tutorial requires the following pip packages:

  • torch-neuron

  • torchvision

  • neuron-cc[tensorflow]

These will be installed by default when configuring your environment using the Neuron PyTorch setup guide.

Compile model for Neuron

The following step will compile the resnet50 model. This will take a few minutes. At the end of script execution, the compiled model is saved as resnet50_neuron.pt in local directory

[ ]:
import torch
import numpy as np
import os
import torch_neuron
from torchvision import models
import logging

## Enable logging so we can see any important warnings
logger = logging.getLogger('Neuron')
logger.setLevel(logging.INFO)

image = torch.zeros([1, 3, 224, 224], dtype=torch.float32)

## Load a pretrained ResNet50 model
model = models.resnet50(pretrained=True)

## Tell the model we are using it for evaluation (not training)
model.eval()

## Analyze the model - this will show operator support and operator count
torch.neuron.analyze_model( model, example_inputs=[image] )

## Now compile the model - with logging set to "info" we will see
## what compiles for Neuron, and if there are any fallbacks
## Note: The "-O2" setting is default in recent releases, but may be needed for DLAMI
##       and older installed environments- model_neuron = torch.neuron.trace(model, example_inputs=[image], compiler_args="-O2")
model_neuron = torch.neuron.trace(model, example_inputs=[image])

# The output of this step will have the percentage of operations compiled, example:
#
# INFO:Neuron:The neuron partitioner created 1 sub-graphs
# INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%

## Export to saved model
model_neuron.save("resnet50_neuron.pt")
print("Compile Args, input tensor: {}, data type:'fp32', 'core': 1 ")
print("Compile success")

Run inference on Single Core

Initially we will just use one of the available neuron cores. Do not perform inference with a neuron traced model on a non neuron supported instance, results will not be calculated.

[ ]:
import os
import time
import torch
import torch_neuron
import json
import numpy as np
from urllib import request
from torchvision import models, transforms, datasets
from time import time

## Create an image directory containing a small kitten
os.makedirs("./torch_neuron_test/images", exist_ok=True)
request.urlretrieve("https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg",
 "./torch_neuron_test/images/kitten_small.jpg")

## Fetch labels to output the top classifications
request.urlretrieve("https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json","imagenet_class_index.json")
idx2label = []

with open("imagenet_class_index.json", "r") as read_file:
 class_idx = json.load(read_file)
 idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]

## Import a sample image and normalize it into a tensor
normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

eval_dataset = datasets.ImageFolder(
    os.path.dirname("./torch_neuron_test/"),
    transforms.Compose([
        transforms.Resize([224, 224]),
        transforms.ToTensor(),
        normalize,
    ])
)
image, _ = eval_dataset[0]
image = torch.tensor(image.numpy()[np.newaxis, ...])

## Load model
model_neuron = torch.jit.load( 'resnet50_neuron.pt' )

## Since the first inference also load the model let's exclude it
## from timing
results = model_neuron( image )

## Predict for 100 loops
start = time()

loops = 100
for _ in range(loops):
    results = model_neuron( image )
elapsed_time = time() - start
images_sec = loops / float(elapsed_time)

# Get the top 5 results
top5_idx = results[0].sort()[1][-5:]

# Lookup and print the top 5 labels
top5_labels = [idx2label[idx] for idx in top5_idx]

print("Top 5 labels:\n {}".format(top5_labels) )
print("Batch size: 1, Throughput: {} images / second".format(round(images_sec,0) ) )

Run Inference on parallel neuron cores

To fully leverage the inferentia hardware we want to use all the cores. On an inf1.xlarge or inf1.2xlarge we have four available cores, with 16 cores on inf1.6xlarge and 64 cores on inf1.24xlarge instances. Here we use the futures library to create a simple class that runs four parallel inference threads

Using all of the available cores is important for achieving maximum performance on Neuron hardware. The implementation below uses an aggregated batch size.

  • It loads the model into four cores

  • At input it accepts a batch four times the size of the compiled model

  • It splits the data across the four cores, and once all cores are done collates the output into a result tensor

This is intended as a good starting implementation - but you may want to vary it depending on your application

In the following we create a data parallel class which handles larger tensor batches.

[ ]:
from concurrent import futures
import torch
import torch.neuron
import os

class NeuronSimpleDataParallel():

    def __init__(self, model_file, num_neuron_cores, batch_size=1):
        # Construct a list of models
        self.num_neuron_cores = num_neuron_cores
        self.batch_size = batch_size

        class SimpleWrapper():

            def __init__(self, model):
                self.model = model

            def eval(self):
                self.model.eval()

            def train(self):
                self.model.train()

            def __call__(self, *args):
                results = self.model(*args)

                # Make the output iterable - if it is not already a tuple or list
                if not isinstance(results, tuple) or isinstance(results, list):
                    results = [results]

                return results

        self.models = [SimpleWrapper(torch.jit.load(model_file))
                       for i in range(num_neuron_cores)]

        ## Important - please read:
        ##     https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/tutorial-NeuronCore-Group.md
        ## For four cores we use
        ##     os.environ['NEURONCORE_GROUP_SIZES'] = "1,1,1,1"
        ## when launching four threads
        ## In this logic exists in worker processes, each process should use
        ##     os.environ['NEURONCORE_GROUP_SIZES'] = "1"
        nc_env = ','.join(['1'] * num_neuron_cores)
        os.environ['NEURONCORE_GROUP_SIZES'] = nc_env

        self.executor = futures.ThreadPoolExecutor(
            max_workers=self.num_neuron_cores)

    def eval(self):
        for m in self.models:
            m.eval()

    def train(self):
        for m in self.models:
            m.train()

    def __call__(self, *args):
        assert all(isinstance(a, torch.Tensor)
                   for a in args), "Non tensor input - tensors are needed to generate batches"
        assert all(a.shape[0] % self.num_neuron_cores ==
                   0 for a in args), "Batch size must be even multiple of the number of parallel neuron cores"

        args_per_core = [[] for i in range(self.num_neuron_cores)]

        # Split args
        for a in args:
            # Based on batch size for arg
            step_size = a.shape[0] // self.num_neuron_cores
            for i in range(self.num_neuron_cores):
                # Append a slice of a view
                start = i * step_size
                end = (i + 1) * step_size

                # Slice
                args_per_core[i].append(a[start:end])

        # Call each core with their split and wait to complete
        running = {self.executor.submit(
            self.models[idx], *args_per_core[idx]): idx for idx in range(self.num_neuron_cores)}

        results = [None] * self.num_neuron_cores

        for future in futures.as_completed(running):
            idx = running[future]

            results[idx] = future.result()

        # Remove zero dimensional tensors (unsqueeze)
        # Iterate results per core
        for ic in range(len(results)):
            # Iterate result tuples
            for ir in range(len(results[ic])):
                # Unsqueeze if zero dimensional or does not look batched (i.e. first dim does not match batch)
                if len(results[ic][ir].size()) == 0 or results[ic][ir].shape[0] != self.batch_size:
                    results[ic][ir] = torch.unsqueeze(
                        results[ic][ir], 0)

        # Concatenate
        output = results[0][0]

        for i in range(1, len(results)):
            for j in range(len(results[i])):
                output = torch.cat([output, results[i][j]], 0)

        return output

Now we can update our inference code for 16 cores:

[ ]:
import os
from time import time
import torch
import torch_neuron
import json
import numpy as np
from urllib import request
from torchvision import models, transforms, datasets

## Assuming you are working on and inf1.xlarge or inf1.2xlarge
num_neuron_cores = 16

## Create an image directory containing a small kitten
os.makedirs("./torch_neuron_test/images", exist_ok=True)
request.urlretrieve("https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg",
                    "./torch_neuron_test/images/kitten_small.jpg")

## Fetch labels to output the top classifications
request.urlretrieve("https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json","imagenet_class_index.json")
idx2label = []

with open("imagenet_class_index.json", "r") as read_file:
    class_idx = json.load(read_file)
    idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]

## Import a sample image and normalize it into a tensor
normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

eval_dataset = datasets.ImageFolder(
    os.path.dirname("./torch_neuron_test/"),
    transforms.Compose([
    transforms.Resize([224, 224]),
    transforms.ToTensor(),
    normalize,
    ])
)
image, _ = eval_dataset[0]
image = torch.tensor(image.numpy()[np.newaxis, ...])

## Load model
model_neuron = NeuronSimpleDataParallel( 'resnet50_neuron.pt', num_neuron_cores )

## Create a "batch" image with enough images to go on each of the four cores
batch_image = image

for i in range(num_neuron_cores - 1):
    batch_image = torch.cat( [batch_image, image], 0 )

print(batch_image.shape)

## Since the first inference also loads the model to the chip let's exclude it
## from timing
results = model_neuron( batch_image )

latency = []
throughput = []

## Predict
loops = 100
start = time()
for _ in range(loops):
    delta_start = time()
    results = model_neuron( batch_image )
    delta = time() - delta_start
    latency.append(delta)
    throughput.append(batch_image.size(0)/delta)
elapsed_time = time() - start
images_sec = loops * batch_image.size(0) / float(elapsed_time)

# Get the top 5 results
top5_idx = results[0].sort()[1][-5:]

# Lookup and print the top 5 labels
top5_labels = [idx2label[idx] for idx in top5_idx]

print("Avg. Throughput: {:.0f}, Max Throughput: {:.0f}".format(np.mean(throughput),np.max(throughput)))
print("Latency P50: {:.0f}".format(np.percentile(latency, 50)*1000.0))
print("Latency P90: {:.0f}".format(np.percentile(latency, 90)*1000.0))
print("Latency P95: {:.0f}".format(np.percentile(latency, 95)*1000.0))
print("Latency P99: {:.0f}".format(np.percentile(latency, 99)*1000.0))

Compile and Infer with different batch sizes on multiple Neuron cores

Different models will show better and worse throughput with different batch sizes. In general neuron models will work best with small batch sizes when compared with GPU inference - even though overall a single neuron instance may outperform a GPU instance on a given task.

As a general best practice we recommend starting with a small batch size and working up to find peak throughput.

Now that we are using all four cores we can experiment with compiling and running larger batch sizes on each of our four cores. In the following we compile using a batch size of 5 - but you can use any value, or test multiple.

[ ]:
import torch
import numpy as np
import os
import torch_neuron
from torchvision import models
import logging

## Enable logging so we can see any important warnings
logger = logging.getLogger('Neuron')
logger.setLevel(logging.INFO)

batch_size = 5

image = torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)

## Load a pretrained ResNet50 model
model = models.resnet50(pretrained=True)

## Tell the model we are using it for evaluation (not training)
model.eval()

## Analyze the model - this will show operator support and operator count
analyze_results = torch.neuron.analyze_model( model, example_inputs=[image] )

print(analyze_results)

## Now compile the model
## Note: The "-O2" setting is default in recent releases, but may be needed for DLAMI
##       and older installed environments ex:model_neuron = torch.neuron.trace(model, example_inputs=[image], compiler_args="-O2")
model_neuron = torch.neuron.trace(model, example_inputs=[image])


# The output of this step will have the percentage of operations compiled, example:
#
# INFO:Neuron:The neuron partitioner created 1 sub-graphs
# INFO:Neuron:Neuron successfully compiled 1 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 100.0%

## Export to saved model
model_neuron.save("resnet50_neuron_b{}.pt".format(batch_size))

In the following we run the inference with batch size of 5.

[ ]:
import os
from time import time
import torch
import torch_neuron
import json
import numpy as np
from urllib import request
from torchvision import models, transforms, datasets

## Assuming you are working on and inf1.xlarge or inf1.2xlarge
num_neuron_cores = 16
batch_size = 5

## Create an image directory containing a small kitten
os.makedirs("./torch_neuron_test/images", exist_ok=True)
request.urlretrieve("https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg",
                    "./torch_neuron_test/images/kitten_small.jpg")

## Fetch labels to output the top classifications
request.urlretrieve("https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json","imagenet_class_index.json")
idx2label = []

with open("imagenet_class_index.json", "r") as read_file:
    class_idx = json.load(read_file)
    idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]

## Import a sample image and normalize it into a tensor
normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

eval_dataset = datasets.ImageFolder(
    os.path.dirname("./torch_neuron_test/"),
    transforms.Compose([
    transforms.Resize([224, 224]),
    transforms.ToTensor(),
    normalize,
    ])
)
image, _ = eval_dataset[0]
image = torch.tensor(image.numpy()[np.newaxis, ...])

## Load model
model_neuron = NeuronSimpleDataParallel( 'resnet50_neuron_b{}.pt'.format(batch_size), num_neuron_cores, batch_size=batch_size )

## Create a "batch" image with enough images to go on each of the four cores
batch_image = image

for i in range((num_neuron_cores * batch_size) - 1):
    batch_image = torch.cat( [batch_image, image], 0 )

## Since the first inference also loads the model to the chip let's exclude it
## from timing
results = model_neuron( batch_image )

latency = []
throughput = []

## Predict
start = time()
loops = 100
for _ in range(loops):
    delta_start = time()
    results = model_neuron( batch_image )
    delta = time() - delta_start
    latency.append(delta)
    throughput.append(batch_image.size(0)/delta)
elapsed_time = time() - start
images_sec = loops * batch_image.size(0) / elapsed_time

# Get the top 5 results
top5_idx = results[0].sort()[1][-5:]

# Lookup and print the top 5 labels
top5_labels = [idx2label[idx] for idx in top5_idx]
print("Top 5 labels:\n {}".format(top5_labels) )
print("[Batch Size: {}, Throughput:{} images / second]".format(
    batch_size, round(images_sec,0)))

print("Avg. Throughput: {:.0f}, Max Throughput: {:.0f}".format(np.mean(throughput),np.max(throughput)))
print("Latency P50: {:.0f}".format(np.percentile(latency, 50)*1000.0))
print("Latency P90: {:.0f}".format(np.percentile(latency, 90)*1000.0))
print("Latency P95: {:.0f}".format(np.percentile(latency, 95)*1000.0))
print("Latency P99: {:.0f}".format(np.percentile(latency, 99)*1000.0))

You can experiment with different batch size values to see what gives the best overall throughput