This document is relevant for: Inf2, Trn1, Trn2

nki.benchmark#

nki.benchmark(kernel=None, **kwargs)[source]#

Benchmark a NKI kernel on a NeuronDevice by using nki.benchmark as a decorator. You must run this API on a Trn/Inf instance with NeuronDevices (v2 or beyond) attached and also aws-neuronx-tools installed on the host using the following steps:

# on Ubuntu
sudo apt-get install aws-neuronx-tools=2.* -y

# on Amazon Linux
sudo yum install aws-neuronx-tools-2.* -y

You may specify a path to save your NEFF file through input parameter save_neff_name and a path to save your NTFF file through save_trace_name. See Profiling NKI kernels with Neuron Profile for more information on how to visualize the execution trace for profiling purposes.

Note

Similar to nki.baremetal, The decorated function using nki.benchmark expects numpy.ndarray as input/output tensors instead of ML framework tensor objects.

In additional to generating NEFF/NTFF files, this decorator also invokes neuron-bench to collect execution latency statistics of the NEFF file and prints the statistics to the console.

neuron-bench is a tool that launches the NEFF file on a NeuronDevice in a loop to collect end-to-end latency statistics. You may specify the number of warm-up iterations to skip benchmarking in input parameter warmup, and the number of benchmarking iterations in iters. Currently, nki.benchmark only supports benchmarking on a single NeuronCore, since NKI not yet supports collective compute. Note, neuron-bench measures not only the device latency but also the time taken to transfer data between host and device. However, the tool does not rely on any ML framework to launch the NEFF and therefore reports NEFF latency without any framework overhead.

Parameters:

warmup – The number of iterations for warmup execution (10 by default).
iters – The number of iterations for benchmarking (100 by default).
save_neff_name – Save the compiled neff file if specify a name (unspecified by default).
save_trace_name – Save the trace (profile) file if specified a name (unspecified by default); at the moment, it requires that the save_neff_name is unspecified or specified as ‘file.neff’.
additional_compile_opt – Additional Neuron compiler flags to pass in when compiling the kernel.

Returns:

A function object that wraps the decorating function. A property benchmark_result.nc_latency is available after invocation. get_latency_percentile(int) of the property returns the specified percentile latency in microsecond(us). Available percentiles: [0, 1, 10, 25, 50, 90, 99, 100]

Listing 12 An Example#

from neuronxcc.nki import benchmark
import neuronxcc.nki.language as nl
import numpy as np

@benchmark(warmup=10, iters = 100, save_neff_name='file.neff', save_trace_name='profile.ntff')
def nki_tensor_tensor_add(a_tensor, b_tensor):
  c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype, buffer=nl.shared_hbm)

  a = nl.load(a_tensor)
  b = nl.load(b_tensor)

  c = a + b

  nl.store(c_tensor, c)

  return c_tensor

a = np.zeros([128, 1024], dtype=np.float32)
b = np.random.random_sample([128, 1024]).astype(np.float32)
c = nki_tensor_tensor_add(a, b)

metrics = nki_tensor_tensor_add.benchmark_result.nc_latency
print("latency.p50 = " + str(metrics.get_latency_percentile(50)))
print("latency.p99 = " + str(metrics.get_latency_percentile(99)))

Note

nki.benchmark does not use the actual inputs passed into the benchmarked function when running the neff file. For instance, in the above example, the output c tensor is undefined and should not be used for numerical accuracy checks.

This document is relevant for: Inf2, Trn1, Trn2

nki.benchmark

Contents

nki.benchmark#