This document is relevant for: Inf2
, Trn1
, Trn1n
nki.benchmark#
- nki.benchmark(kernel=None, **kwargs)[source]#
Benchmark a NKI kernel on a NeuronDevice by using
nki.benchmark
as a decorator. You must run this API on a Trn/Inf instance with NeuronDevices (v2 or beyond) attached and alsoaws-neuronx-tools
installed on the host using the following steps:# on Ubuntu sudo apt-get install aws-neuronx-tools=2.* -y # on Amazon Linux sudo yum install aws-neuronx-tools-2.* -y
This decorator invokes
nki.baremetal
to compile the NKI kernel into an executable on NeuronDevices (NEFF
) and collect an execution trace (NTFF
) under the hood. You may specify a path to save your NEFF file through input parametersave_neff_name
and a path to save your NTFF file throughsave_trace_name
. See Profiling NKI kernels with Neuron Profile for more information on how to visualize the execution trace for profiling purposes.Note
Similar to
nki.baremetal
, The decorated function usingnki.benchmark
expects numpy.ndarray as input/output tensors instead of ML framework tensor objects.In additional to generating NEFF/NTFF files, this decorator also invokes
neuron-bench
to collect execution latency statistics of the NEFF file and prints the statistics to the console.neuron-bench
is a tool that launches the NEFF file on a NeuronDevice in a loop to collect end-to-end latency statistics. You may specify the number of warm-up iterations to skip benchmarking in input parameterwarmup
, and the number of benchmarking iterations initers
. Currently,nki.benchmark
only supports benchmarking on a single NeuronCore, since NKI not yet supports collective compute. Note,neuron-bench
measures not only the device latency but also the time taken to transfer data between host and device. However, the tool does not rely on any ML framework to launch the NEFF and therefore reports NEFF latency without any framework overhead.- Parameters:
warmup – The number of iterations for warmup execution (10 by default).
iters – The number of iterations for benchmarking (100 by default).
save_neff_name – Save the compiled neff file if specify a name (unspecified by default).
save_trace_name – Save the trace (profile) file if specified a name (unspecified by default); at the moment, it requires that the save_neff_name is unspecified or specified as ‘file.neff’.
additional_compile_opt – Additional Neuron compiler flags to pass in when compiling the kernel.
- Returns:
A function object that wraps the decorating function. A property
benchmark_result.nc_latency
is available after invocation.get_latency_percentile(int)
of the property returns the specified percentile latency in microsecond(us). Available percentiles: [0, 1, 10, 25, 50, 90, 99, 100]
from neuronxcc.nki import benchmark import neuronxcc.nki.language as nl import numpy as np @benchmark(warmup=10, iters = 100, save_neff_name='file.neff', save_trace_name='profile.ntff') def nki_tensor_tensor_add(a_tensor, b_tensor, c_tensor): a = nl.load(a_tensor) b = nl.load(b_tensor) c = a + b nl.store(c_tensor, c) a = np.zeros([128, 1024], dtype=np.float32) b = np.random.random_sample([128, 1024]).astype(np.float32) c = np.ndarray(shape=(128, 1024), dtype=np.float32) nki_tensor_tensor_add(a, b, c) metrics = nki_tensor_tensor_add.benchmark_result.nc_latency print("latency.p50 = " + str(metrics.get_latency_percentile(50))) print("latency.p99 = " + str(metrics.get_latency_percentile(99)))
This document is relevant for: Inf2
, Trn1
, Trn1n