This document is relevant for: Inf2, Trn1, Trn1n

Profiling NKI kernels with Neuron Profile#

In this tutorial, we use Neuron Profile to view the execution trace of a NKI kernel captured on a NeuronCore. In doing so, we learn about:

  • Installation and usage of Neuron Profile.

  • Inspecting a detailed execution timeline of compute engine instructions and DMA engine activities generated from your NKI kernel.

As background, Neuron Profile is the tool you need to visualize where time is being spent during kernel execution on NeuronDevices, which is crucial for identifying performance bottlenecks and opportunities of your kernel. Neuron Profile produces runtime execution data for every instruction executed on each compute engine and also every data movement activity completed by DMA engines. Neuron Profile also reports key performance metrics such as compute engine and memory bandwidth utilization, which allows developers to quickly find out the achieved hardware efficiency of their kernel. Profiling typically has near zero overhead thanks to the dedicated on-chip profiling hardware in NeuronDevices.

Install Neuron Profile#

Make sure you have the latest version of the aws-neuronx-tools, which includes updated profiling support for NKI kernels. Neuron Profile is included within this package and is installed to /opt/aws/neuron/bin.

The aws-neuronx-tools package comes pre-installed on Neuron DLAMIs. For detailed installation instructions see Neuron Profile User Guide: Installation.

Profile a NKI Kernel#

Profile using neuron-profile capture#

To profile a NKI kernel the required steps are (1) enable NEURON_FRAMEWORK_DEBUG to tell the compiler to save the NEFF file, (2) execute the NKI kernel to generate the NEFF, and (3) run neuron-profile capture to generate a NTFF profile. Each step is described in more detail below.

We will profile a NKI kernel which computes the element-wise exponential of an input tensor of any 2D shape. The rest of this tutorial will use a performance profile generated from this kernel as an example. Full code of prof-kernel.py:

 1"""
 2Example kernel used to demmonstrate Neuron Profile.
 3"""
 4import torch
 5import neuronxcc.nki.language as nl
 6from torch_neuronx import nki_jit
 7import math
 8import os
 9os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
10os.environ["NEURON_CC_FLAGS"]= " --disable-dge "
11
12@nki_jit
13def tensor_exp_kernel_(in_tensor, out_tensor):
14  """NKI kernel to compute elementwise exponential of an input tensor
15
16  Args:
17      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)
18      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)
19  """
20  sz_p, sz_f = in_tensor.shape
21
22  i_f = nl.arange(sz_f)[None, :]
23
24  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):
25    # Generate tensor indices for the input/output tensors
26    # pad index to pmax, for simplicity
27    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]
28
29    # Load input data from external memory to on-chip memory
30    # only read up to sz_p
31    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))
32
33    # perform the computation
34    out_tile = nl.exp(in_tile)
35
36    # store the results back to external memory
37    # only write up to sz_p
38    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))
39
40if __name__ == "__main__":
41  from torch_xla.core import xla_model as xm
42  device = xm.xla_device()
43
44  in_tensor = torch.rand((250, 512), dtype=torch.float32).to(device=device)
45  out_tensor = torch.zeros((250, 512), dtype=torch.float32).to(device=device)
46
47  tensor_exp_kernel_(in_tensor, out_tensor)
48  print(f"output_nki={out_tensor}")

To profile this NKI kernel, follow these steps:

1. Enable Neuron debug output by setting the NEURON_FRAMEWORK_DEBUG environment variable. This will trigger the Neuron compiler to save the Neuron Executable File Format (NEFF) artifact to the current directory after compilation of your NKI kernel. The NEFF contains all hardware instructions required to execute your NKI kernel on a NeuronDevice, as well as metadata and debug info needed for profiling. For example, add the following lines to your NKI kernel source file:

import os
os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
os.environ["NEURON_CC_FLAGS"]= " --disable-dge "

Note

Use the flag --disable-dge to temporarily disable a new compiler feature which is interfering with DMA debugging information display in neuron-profile. This is highly recommended to improve NKI performance debugging experience until we release a software fix for this issue.

2. Compile your NKI kernel to create a NEFF in your current directory:

$ python3 prof-kernel.py

Note

Find your NEFF named similarly to MODULE_0_SyncTensorsGraph.13_12659246067793504316.neff.

3. Profile the NEFF. This profiling step executes the NEFF on the NeuronDevice and records a raw execution trace into an Neuron Trace File Format (NTFF) artifact.

$ neuron-profile capture -n <path_to_neff> -s profile.ntff --profile-nth-exec=2

This will save your NTFF profile to profile_exec_2.ntff.

Note

The --profile-nth-exec=2 option will profile your NEFF twice on the NeuronDevice and output a NTFF profile for the second iteration. This is recommended to avoid one-time warmup delays which can be seen in the first iteration of execution.

In View Neuron Profile UI, we will view the profile in a user-friendly format using the Neuron Profile UI.

Profile using nki.benchmark#

You may also use the nki.benchmark API to generate a NEFF and NTFF programmatically. One caveat is nki.benchmark runs your NEFF without an ML framework in nki.baremetal mode, so the input tensors to the kernel must be NumPy arrays instead of framework tensors such as torch.Tensor.

Below is an example NKI kernel decorated by nki.benchmark. Full code of prof-kernel-benchmark.py:

 1"""
 2Example kernel used to demmonstrate Neuron Profile with nki.benchmark.
 3"""
 4from neuronxcc.nki import benchmark
 5from neuronxcc.nki.typing import tensor
 6import neuronxcc.nki.language as nl
 7import math
 8
 9@benchmark(save_neff_name='file.neff', save_trace_name='profile.ntff')
10def tensor_exp_kernel_(in_tensor, out_tensor):
11  """NKI kernel to compute elementwise exponential of an input tensor
12  Args:
13      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)
14      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)
15  """
16  sz_p, sz_f = in_tensor.shape
17  i_f = nl.arange(sz_f)[None, :]
18  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):
19    # Generate tensor indices for the input/output tensors
20    # pad index to pmax, for simplicity
21    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]
22    # Load input data from external memory to on-chip memory
23    # only read up to sz_p
24    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))
25    # perform the computation
26    out_tile = nl.exp(in_tile)
27    # store the results back to external memory
28    # only write up to sz_p
29    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))
30
31if __name__ == "__main__":
32  tensor_exp_kernel_(tensor[[250, 512], nl.float32],
33                     tensor[[250, 512], nl.float32])

To use nki.benchmark to create a NEFF file and NTFF profile in your current directory, execute the example NKI kernel with:

$ python3 prof-kernel-benchmark.py

In View Neuron Profile UI, we will view the profile in a user-friendly format using the Neuron Profile UI.

View Neuron Profile UI#

Neuron Profile has an interactive web based UI used to view execution traces. In this section we will open Neuron Profile UI and view NKI specific profiling information. NKI specific information can be found in several places including instruction hover details, instruction click details, search results, and box select results. This section assumes that you followed the previous step to create a NEFF and NTFF.

To view the Neuron Profile web UI, execute the view command:

$ neuron-profile view -n <path_to_neff> -s <path_to_ntff> --db-bucket=my_kernel

The above command should print a URL that you can click to open the web UI:

View profile at http://localhost:3001/profile/my_kernel

Note

You must keep the view command running when viewing profiles.

Note

The --db-bucket=my_kernel argument is used to set a custom URL for the profile. Omitting this argument will generate a URL with a unique ID.

If neuron-profile view is run on a remote instance, you may need to use port forwarding to access the web UI. From your local machine, SSH to the remote instance and forward ports 3001 (the default neuron-profile HTTP server port) and 8086 (the default InfluxDB port). Then in the browser, go to localhost:3001 to view the profiles.

$ ssh <user>@<ip> -L 3001:localhost:3001 -L 8086:localhost:8086
../../_images/neuron-profile-ui-overview.png

Fig. 33 Screenshot of the Neuron Profile UI.#

If you hover over any engine instruction in the timeline with your mouse, you will see instruction details in a pop-up box.

../../_images/neuron-profile-instruction-hover.png

Fig. 34 Instruction hover details including the line of NKI source code that generated this instruction.#

If you click on any engine instruction in the timeline with your mouse, you will see instruction details in a panel below the timeline.

../../_images/neuron-profile-instruction-details.png

Fig. 35 Instruction click details including the line of NKI source code that generated this instruction.#

Box Select#

You can click and drag on the timeline to select a range of instructions using the Box Select functionality. A summary will be produced that includes which lines of NKI source code produced these instructions. This helps with understanding a portion of the timeline. Selecting a large number of instructions may take some time to retrieve from the database.

../../_images/neuron-profile-box-select-help.png

Fig. 38 Click on the “Box Select” button and then click and drag on a region of the timeline.#

../../_images/neuron-profile-box-select-results.png

Fig. 39 Box select results. The line of NKI source code that generated each instruction will appear in the box select summary.#

Note

An empty value for “nki_source_location” means that the instruction is not associated with a NKI source code line.

View NKI Source Code in Neuron Profile#

You may optionally include NKI source code file contents for display in Neuron Profile. This feature loads your NKI source code into an integrated code viewer, side-by-side with the execution timeline in the web UI. Including the source code makes it easier to navigate between instruction trace and NKI source code and also to track the version of code that produced the profile. Note, even without uploading the source code to Neuron Profile, the NKI source filename and line number are always available in instruction detail view as discussed in View Neuron Profile UI.

To include NKI source code in the Neuron Profile UI you can use the view command with the --nki-source-root argument to pass in the folder of NKI source code:

$ neuron-profile view -n <path_to_neff> -s <path_to_ntff> --nki-source-root /home/ubuntu/my_nki/ --db-bucket=my_kernel

To open the NKI source code viewer, click on an instruction that has a “Nki source location” field as shown in Fig. 35. In the instruction’s details panel, the “nki_source_location” field should appear as a link. Clicking on the link will open the NKI source code viewer and highlight the associated line of NKI code. Inside the source code viewer, you can also click on any line of NKI source code to search for all instructions that were generated by that line of code.

../../_images/neuron-profile-source-code-view.png

Fig. 40 NKI source code view.#

View Neuron Profile output as JSON#

As an alternative to the Neuron Profile web UI, a JSON format output is available. The JSON output includes the profile summary and all events in the execution trace. To generate the JSON output, execute the following command:

$ neuron-profile view --output-format json --output-file profile.json -n <path_to_neff> -s <path_to_ntff>
$ cat profile.json
{
   "summary": [
      {
         "total_time": 0.017,
         "event_count": 11215,
         [...]
      }
      "instruction": [
         {
               "timestamp": 10261883214,
               "duration": 148,
               "label": "TensorMatrix",
               "opcode": "MATMUL",
               "nki_source_location": "prof-kernel.py:33",
               [...]
         },
}

See also#

This document is relevant for: Inf2, Trn1, Trn1n