This document is relevant for: Inf1

NeuronPerf Benchmark Guide#

The call to neuronperf[torch/tensorflow/mxnet/cpu].benchmark is used to measure your model performance. It will choose reasonable defaults if none are provided, and will return back reports that summarize the benchmarking results.

What is the default behavior of benchmark?#

That will depend how you provided your model and how your model was compiled.

The two most common ways to provide your model are:

  1. Provide the path to your compiled model

  2. Provide the path to a model index from neuronperf.compile (a JSON file)

Data Parallel#

If you choose 1, then…

Your model is benchmarked on provided inputs in 4 different configurations:
  1. A single model on 1 NeuronCore with one worker (min. latency)

  2. A single model on 1 NeuronCore with two workers (max. throughput / NC)

  3. MAX models on MAX NeuronCores with one worker (min. latency + max. instance usage)

  4. MAX models on MAX NeuronCores with two workers (max. throughput + max. instance usage)

The value MAX is automatically determined by your instance size. If it can’t be identified, those configurations will be skipped.

The primary benefit of (3) and (4) is to verify that your model scales well at maximum instance usage.

If you chose 2, then…

  • Your input parameters to benchmark (batch_sizes, etc.) are treated as filters on the index

  • Each remaining model configuration is benchmarked as described in (1)

Pipeline#

Pipeline mode is active when using a Neuron device and pipeline_sizes > 1. The same behavior as described in Data Parallel applies, except that only one worker configuration is executed: the optimal number of workers for your pipeline size, unless manually overridden.

Parameters#

Below are some useful and common parameters to tweak. Please see the NeuronPerf API for full details.

  • n_models controls how many models to load. The default behavior is n_models=[1, MAX].

  • workers_per_model controls how many worker threads will be feeding inputs to each model. The default is automatically determined.

  • pipeline_sizes tells the benchmarker how many cores are needed for your model so that each model instance can be loaded properly. Default is 1.

  • duration controls how long to run each configuration.

  • batch_sizes is used to inform the benchmarker of your input shape so that throughput can be computed correctly.

Almost all NeuronPerf behaviors are controllable via arguments found in the NeuronPerf API. This guide attempts to provide some context and examples for those arguments.

Inputs#

Models accept one or more inputs to operate on. Since NeuronPerf needs to support multiple inputs for multiple models, as well as multi-input models, there are some details that may need your attention. See the NeuronPerf Framework Notes for details.

Multi-input Models#

If your model accepts multiple inputs, you must provide them in a tuple. For example, suppose you have a model like this:

class Model(torch.nn.Module):
        def forward(self, x, y, z):
                ...
                return output

In order for NeuronPerf to pass along your multiple inputs correctly, you should provide them as a tuple:

inputs = (x, y, z)
npf.torch.benchmark(model_filename, inputs, ...)

If you are compiling and/or benchmarking multiple models, you can pass different sized inputs as a list of tuples:

inputs = [(x1, y1, z1), (x2, y2, z2), ...]
npf.torch.benchmark(model_filename, inputs, ...)

Preprocessing and Postprocessing#

Many models have additional preprocessing and postprocessing steps involved that may add non-negligible overhead to inference time. NeuronPerf supports these use cases through the use of custom functions.

Preprocessing#

Recall that NeuronPerf expects (or wraps) each model input into a tuple. These tuples will be unpacked before calling your model.

Here is an example for a model with one input. The example multiples the input by 5 before inference.

def preprocess_fn(x):
    return x * 5

...

# Benchmark with custom preprocessing function
reports = npf.torch.benchmark(
        filename,
        inputs,
        ...,
        preprocess_fn = preprocess_fn,
)

Or if your model expects multiple inputs:

def preprocess_fn(x, y, z):
    return x / 255, y / 255, z / 255

...

# Benchmark with custom preprocessing function
reports = npf.torch.benchmark(
        filename,
        inputs,
        ...,
        preprocess_fn = preprocess_fn,
)

Postprocessing#

Postprocessing is almost identical to preprocessing, except that your function will receive whatever the output of your model is, exactly as returned without modification. There are no type guarantees.

def postprocess_fn(x):
   return x.argmax()

...

# Benchmark with custom preprocessing function
reports = npf.torch.benchmark(
      filename,
      inputs,
      ...,
      postprocess_fn = postprocess_fn,
)

Minimal Latency#

Suppose you are interested in the minimal latency achievable with your model. In this case, there is no need for more than one worker to execute at a time. We can manually specify the number of workers to use. See below Worker Threads.

Worker Threads#

The argument workers_per_model controls the number of worker threads that are trying to prepare and load examples onto a single NeuronCore at a time. Therefore, a value of 1 corresponds to 1 thread / model. If n_models=16, then there would be 16 worker threads, one per model. This number is selected based upon whether you are using DataParallel (i.e. pipeline_sizes == 1), or Pipeline Mode (pipeline_sizes != 1).

By default, NeuronPerf will try to pick try multiple combinations of model copies and workers. You may be interested in controlling this manually.

reports = npf.torch.benchmark('model_neuron_b1.pt', ..., workers_per_model=1)

You may also pass a list, as with other parameters:

workers_per_model = [1, 2] # Same as the default for data parallel
reports = npf.torch.benchmark('model_neuron_b1.pt', ..., workers_per_model=workers_per_model)

With the default number of Model Copies, a call to print_results might look like this:

throughput_avg latency_ms_p50 latency_ms_p99 n_models       pipeline_size  workers_per_model batch_size     model_filename
307.25         3.251          3.277          1              1              1                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt
2746.0         5.641          6.82           16             1              1                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt
329.5          6.053          6.108          1              1              2                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt
2809.0         10.246         12.52          16             1              2                 1              models/a5cff386-89ca-4bbf-9087-d0e624c3c604.pt

Model Copies#

By default, NeuronPerf will benchmark two settings for n_models:
  1. A single copy

  2. The maximum number number of copies for your instance size

You can override this behavior by passing n_models to benchmark, as shown below:

reports = npf.torch.benchmark('model_neuron_b1.pt', ..., n_models=6)

or

n_models = list(range(1, 10))
reports = npf.torch.benchmark('model_neuron_b1.pt', ..., n_models=n_models)

Pipeline Mode#

By default, NeuronPerf will assume you intend to use DataParallel, with two exceptions:

  • You compiled your model using NeuronPerf for pipeline mode

  • You constructed a neuronperf_model_index that uses pipeline mode

You can also manually tell NeuronPerf that your model was compiled for pipeline mode. It is similar to how other arguments are passed.

reports = npf.torch.benchmark('model_neuron_b1.pt', ..., pipeline_sizes=2)

If you are passing multiple models in an index, then you should pass a list for pipeline_sizes.

reports = npf.torch.benchmark('model_index.json', ..., pipeline_sizes=[1, 2, 3])

Duration#

NeuronPerf will benchmark each configuration specified for 60 seconds by default. You can control the duration by passing duration (in seconds).

reports = npf.torch.benchmark('model_index.json', ..., duration=10)

Warning

If you make the duration too short, it may expire before all models are loaded and have had time to execute.

Custom Datasets (Beta)#

Currently, only PyTorch supports custom datasets, and the interface is subject to change. If you provide a custom dataset, it will be fully executed on each loaded model copy. So if you provide n_models=2, your dataset will be run through twice in parallel.

To use this API, call benchmark passing a torch.utils.data.Dataset to inputs. You can easily create your own Dataset by implementing the interface, or use one of the available datasets. For example:

import torchvision

dataset = torchvision.datasets.FashionMNIST(
   root="data",
   train=False,
   download=True,
   transform=ToTensor()
)

reports = npf.torch.benchmark('model_index.json', inputs=dataset, batch_sizes=[8], preprocess_fn=lambda x: x[0], loop_dataset=False)

Note

The preprocess_fn is required here to extract image input from the (image, label) tuple generated by dataloader. If the length of dataset is not sufficient to get the runtime performance, one can set loop_dataset=True to rerun dataset until certain duration.

Results#

Viewing and Saving#

There are currently three ways to view results.

  • neuronperf.print_reports(...)
    • Dump abbrieviated results in your terminal

  • neuronperf.write_csv(...)
    • Store metrics of interest as CSV

  • neuronperf.write_json(...)
    • Store everything as JSON

See the NeuronPerf API for full details.

Full Timing Results#

NeuronPerf automatically combines and summarizes the detailed timing information collecting during benchmarking. If you wish to receive everything back yourself, you can use:

results = npf.torch.benchmark('model_index.json', ..., return_timers=True)

If you later wish to produce reports the same way that NeuronPerf does internally, you can call:

reports = npf.get_reports(results)

Verbosity#

Verbosity is an integer, currently one of {0, 1, 2}, where:

  • 0 = SILENT

  • 1 = INFO (default)

  • 2 = VERBOSE / DEBUG

Example:

reports = npf.torch.benchmark(..., n_models=1, duration=5, verbosity=2)
DEBUG:neuronperf.benchmarking - Cast mode was not specified, assuming default.
INFO:neuronperf.benchmarking - Benchmarking 'resnet50.json', ~5 seconds remaining.
DEBUG:neuronperf.benchmarking - Running model config: {'model_filename': 'models/model_b1_p1_83bh3hhs.pt', 'device_type': 'neuron', 'input_idx': 0, 'batch_size': 1, 'n_models': 1, 'workers_per_model': 2, 'pipeline_size': 1, 'cast_mode': None, 'multiprocess': True, 'multiinterpreter': False, 'start_dts': '20211111-062818', 'duration': '5'}
DEBUG:neuronperf.benchmarking - Benchmarker 0 started.
DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 0 started.
DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 1 started.
DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 0 finished after 738 inferences.
DEBUG:neuronperf.benchmarking - Benchmarker 0, Worker 1 finished after 738 inferences.
DEBUG:neuronperf.benchmarking - Benchmarker 0 finished.
throughput_avg latency_ms_p50 latency_ms_p99 n_models       pipeline_size  workers_per_model batch_size     model_filename
329.667        6.073          6.109          1              1              2                 1              models/model_b1_p1_83bh3hhs.pt

Internal Process Model#

For each model loaded (see Model Copies), a process is spawned. Each process may use multiple threads (see Worker Threads). The threads will continue to load examples and keep the hardware busy.

NeuronPerf spawns processes slightly differently between frameworks. For PyTorch and Apache MXNet (Incubating), processes are forked. For Tensorflow/Keras, a fresh interpreter is launched, and benchmarkers are serialized and run as a script.

If you suspect you are having trouble due to the way processes are managed, you have two mechanisms of control:

reports = npf.torch.benchmark(..., multiprocess=False)

Default is True, and False will disable multiprocessing and run everything inside a single parent process. This may not work for all frameworks beyond the first model configuration, because process teardown is used to safely deallocate models from the hardware. It is not recommeneded to benchmark this way.

reports = npf.torch.benchmark(..., multiinterpreter=True)

This flag controls whether a fresh interpreter is used instead of forking. Defaults to False except with Tensorflow/Keras.

Benchmark on CPU or GPU#

When benchmarking on CPU or GPU, the API is slightly different. With CPU or GPU, there is no compiled model to benchmark, so instead we need to directly pass a reference to the model class that will be instantiated.

Note

GPU benchmarking is currently only available for PyTorch.

CPU:

cpu_reports = npf.cpu.benchmark(YourModelClass, ...)

GPU:

gpu_reports = npf.torch.benchmark(YourModelClass, ..., device_type="gpu")

Your model class will be instantiated in a subprocess, so there are some things to keep in mind.

  • Your model class must be defined at the top level inside a Python module
    • i.e. don’t place your model class definition inside a function or other nested scope

  • If your model class has special Python module dependencies, consider importing them inside your class __init__

  • If your model class expects constructor arguments, wrap your class so that it has no constructor arguments

Example of a wrapped model class for CPU/GPU benchmarking:

class ModelWrapper(torch.nn.Module):
   def __init__(self):
      super().__init__()
      from transformers import AutoModelForSequenceClassification
      model_name = "bert-base-cased"
      self.bert = AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=False)
      self.add_module(model_name, self.bert)

   def forward(self, *inputs):
      return self.bert(*inputs)


reports = npf.torch.benchmark(ModelWrapper, inputs, device_type="gpu")

This document is relevant for: Inf1