This document is relevant for: Inf1, Inf2, Trn1, Trn2

Neuron Profiler 2.0 (Beta) User Guide#

Overview#

Neuron Profiler 2.0 offers a user-friendly experience for capturing and analyzing application performance through both high-level system profiles and detailed device-level profiles. Users can profile their workloads using framework-specific APIs within their application code or by setting an environment variable before execution. This tool supports profiling for both single-node and distributed workloads, integrating with environments such as ParallelCluster and EKS. Once captured, profile results can be explored through multiple interfaces: the Neuron Profiler UI, the open-source trace viewer Perfetto, or by exporting to a human-readable JSON format. This flexibility in data capture and visualization enables users to gain comprehensive insights into their application’s performance across various scenarios and scales.

Note

Neuron Profiler 2.0 is a set of new features currently in beta that enhance and simplify the experience of capturing and viewing profiles. It is not a replacement of Neuron Profiler, which is the existing feature set specifically for capturing and viewing device profiles.

Capturing profiles#

Neuron Profiler 2.0 offers several flexible options for capturing profiles. Users can either set an environment variable NEURON_RT_INSPECT_ENABLE or use the PyTorch or JAX profiling APIs from their application code for fine-grained control over which sections of their code are profiled. PyTorch and JAX users who prefer not to modify their application code can still enable profiling by setting the environment variable before running their application.

JAX User Experience#

JAX Setup#

Follow the JAX Setup instructions to install the required JAX Neuron Plugin and the latest Neuron Driver, Runtime and Tools packages.

JAX Profiler#

The JAX context-managed profiling API allows you to profile blocks of code. This will capture a system profile including a Neuron Runtime API trace and Python trace for your application code in the captured block. This will also capture device profiles for any compiled graphs (NEFFs) executed on NeuronCores within this block. To use the profiler, import the jax package.

import jax

Profiling is enabled for all code enclosed in the context when using with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):

Note

It is important to pass the output directory os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"] to with jax.profiler.trace and run export NEURON_RT_INSPECT_OUTPUT_DIR=<your output directory> before enabling profiling. This ensures all captured profile data is saved to the correct output directory.

Custom Annotations in JAX#

To add custom annotations to blocks of code in your profile, you can use jax.profiler.TraceAnnotation. Annotation names can be created at runtime, such as in the example here using with jax.profiler.TraceAnnotation("my_label"+str(i)):. For more information on TraceAnnotations, see the official JAX documentation.

JAX Profiling using environment variable#

Instead of using the jax.profiler context manager, you can enable profiling for your entire application using an environment variable. This is desirable if you want to capture a profile without modifying your application code. To enable profiling with the environment variable NEURON_RT_INSPECT_ENABLE=1 and NEURON_RT_INSPECT_OUTPUT_DIR=./output before running your application.

For example:

# make sure to remove call to with jax.profiler.trace from python script
NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python jax_script.py

When using the NEURON_RT_INSPECT_ENABLE environment variable instead of jax.profiler, system profiles will not contain a framework and application code trace, only Neuron Runtime API trace.

Do not set the NEURON_RT_INSPECT_ENABLE environment variable and use the jax.profiler within your application code at the same time. Use one or the other.

For more profiling options that can be set through environment variables, see the section Profile Capture Environment Variables.

Full JAX Example#

Create a file jax_script.py which performs repeated matrix multiplications distributed across Neuron devices.

from functools import partial
import os
import jax
import jax.numpy as jnp

from jax.sharding import Mesh, NamedSharding, PartitionSpec as P
from jax.experimental.shard_map import shard_map
from time import sleep

os.environ["XLA_FLAGS"] = "--xla_dump_hlo_snapshots --xla_dump_to=./dump"

jax.config.update("jax_default_prng_impl", "rbg")

mesh = Mesh(jax.devices(), ('i',))

def device_put(x, pspec):
    return jax.device_put(x, NamedSharding(mesh, pspec))

lhs_spec = P('i', None)
lhs = device_put(jax.random.normal(jax.random.key(0), (128, 128)), lhs_spec)

rhs_spec = P('i', None)
rhs = device_put(jax.random.normal(jax.random.key(1), (128, 16)), rhs_spec)

@jax.jit
@partial(shard_map, mesh=mesh, in_specs=(lhs_spec, rhs_spec), out_specs=rhs_spec)
def matmul_allgather(lhs_block, rhs_block):
    rhs = jax.lax.all_gather(rhs_block, 'i', tiled=True)
    return lhs_block @ rhs

with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):
    out = matmul_allgather(lhs, rhs)
    for i in range(10):
        with jax.profiler.TraceAnnotation("my_label"+str(i)):
            out = matmul_allgather(lhs, rhs)
        sleep(0.001)

expected = lhs @ rhs
with jax.default_device(jax.devices('cpu')[0]):
    equal = jnp.allclose(jax.device_get(out), jax.device_get(expected), atol=1e-3, rtol=1e-3)
    print("Tensors are the same") if equal else print("Tensors are different")

Set your profile output directory and run the script:

export NEURON_RT_INSPECT_OUTPUT_DIR=./output
python jax_script.py

PyTorch User Experience#

PyTorch Setup#

Follow the PyTorch Setup instructions to install the required PyTorch Neuron packages as well as the latest Neuron Driver, Runtime and Tools.

PyTorch Profiler#

The PyTorch context-managed profiling API allows you to profile blocks of code. This will capture a system profile including a Neuron Runtime API trace and Python trace for your application code in the captured block. This will also capture device profiles for any compiled graphs executed on NeuronCores within this block. To use the profiler, import it in your application:

from torch_neuronx.experimental import profiler

Then profile a block of code using:

with torch_neuronx.experimental.profiler.profile(
    port=9012,
    profile_type='system',
    target='neuron_profile_perfetto',
    output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'],
    ms_duration=30000) as profiler:

After modifying your code to call the profiler, run your application as you normally would but set the environment variable NEURON_RT_INSPECT_OUTPUT_DIR to specify the output directory.

NEURON_RT_INSPECT_OUTPUT_DIR=./output python application.py

Note

it is essential to set output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'] when starting the profiler from your application code. This ensures that all profile data sources dump to the same output directory.

PyTorch Profiling using Environment Variable#

Instead of using the torch_neuronx.experimental.profiler.profile context manager, you can enable profiling for your entire application using environment variable. This is desirable if you want to capture a profile without modifying your application code. To enable profiling with environment variable NEURON_RT_INSPECT_ENABLE=1 and NEURON_RT_INSPECT_OUTPUT_DIR=./output before running your application.

For example

# make sure to remove call to with torch_neuronx.experimental.profiler.profile from python script
NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python pytorch_script.py

When using the NEURON_RT_INSPECT_ENABLE environment variable instead of torch_neuronx.experimental.profiler.profile system profiles will not contain a framework and application code trace, only Neuron Runtime API trace.

Do not set the NEURON_RT_INSPECT_ENABLE environment variable and use the torch_neuronx.experimental.profiler.profile within your application code at the same time. Use one or the other.

For more profiling options that can be set through environment variables, see the section Profile Capture Environment Variables.

Full PyTorch Example#

Create a file train_torchrun_context.py with the following contents

import os

import torch
import torch.nn as nn
import torch.nn.functional as F

# XLA imports
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.profiler as xp

import torch_neuronx
from torch_neuronx.experimental import profiler

os.environ["NEURON_CC_FLAGS"] = "--cache_dir=./compiler_cache"

# Global constants
EPOCHS = 2

# Declare 3-layer MLP Model
class MLP(nn.Module):
    def __init__(self, input_size=10, output_size=2, layers=[5, 5]):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, layers[0])
        self.fc2 = nn.Linear(layers[0], layers[1])
        self.fc3 = nn.Linear(layers[1], output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return F.log_softmax(x, dim=1)

def main():
    # Fix the random number generator seeds for reproducibility
    torch.manual_seed(0)

    # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
    device = xm.xla_device()

    # Start the profiler context-manager
    with torch_neuronx.experimental.profiler.profile(
        port=9012,
        profile_type='system',
        target='neuron_profile_perfetto',
        output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'],
        ms_duration=30000) as profiler:

        # IMPORTANT: the model has to be transferred to XLA within
        # the context manager, otherwise profiling won't work
        model = MLP().to(device)
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
        loss_fn = torch.nn.NLLLoss()

        # start training loop
        print('----------Training ---------------')
        model.train()
        for epoch in range(EPOCHS):
            optimizer.zero_grad()
            train_x = torch.randn(1, 10).to(device)
            train_label = torch.tensor([1]).to(device)

            # forward
            loss = loss_fn(model(train_x), train_label)

            # back
            loss.backward()
            optimizer.step()

            # XLA: collect ops and run them in XLA runtime
            xm.mark_step()

    print('----------End Training ---------------')

if __name__ == '__main__':
    main()

Run this workload with the following command:

NEURON_RT_INSPECT_OUTPUT_DIR="output" python simple_demo.py

Non-framework Specific User Experience#

You can also control profiling with environment variables. This is useful when you can’t easily change your application code, such as when running an executable which calls the Neuron Runtime or in a containerized environment where the application code is built into the container image.

Profile Capture Environment Variables#

  • NEURON_RT_INSPECT_ENABLE: Set to 1 to enable system and device profiles. For control over which profile types are captured use NEURON_RT_INSPECT_SYSTEM_PROFILE and NEURON_RT_INSPECT_DEVICE_PROFILE.

  • NEURON_RT_INSPECT_OUTPUT_DIR: The directory where captured profile data will be saved to. Defaults to ./output.

  • NEURON_RT_INSPECT_SYSTEM_PROFILE: Set to 0 to disable the capture of system profiles. Defaults to 1 when NEURON_RT_INSPECT_ENABLE is set to 1.

  • NEURON_RT_INSPECT_DEVICE_PROFILE: Set to 0 to disable the capture of device profiles. Defaults to 1 when NEURON_RT_INSPECT_ENABLE is set to 1.

  • NEURON_RT_INSPECT_DURATION_NSEC: Duration in nanoseconds of the profile capture session. After this time, the profiler will detach from the running workload. A value of 0 will run the profiler for the entire duration of the application. Defaults to 0.

  • NEURON_RT_INSPECT_START_OFFSET_NSEC: Time in nanoseconds to wait between launching the workload and starting profile capture. A value of 0 starts profiling immediately.

Example Capturing Profile of Application Using Environment Variables#

Instead of using the PyTorch or JAX profilers you can profile your Python application (or any application calling the Neuron Runtime API) using environment variables.

NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python app.py

See Profile Capture Environment Variables for other profiling options that can be set via environment variable.

Note

There is a known issue with PyTorch not dumping trace files to disk when not using the context-managed profiling API and instead using the NEURON_RT_INSPECT_ENABLE environment variable to enable profiling. This can be mitigated by setting export NEURON_RT_INSPECT_DUMP_PERIOD_SEC=30. This will trigger trace dump files to be written every 30 seconds. If your workload runs for less than 30 seconds, you may still not get trace dump files on disk.

Example Capturing Profile of nccom-test Using Environment Variables#

Profiling can be enabled using environment variables. For simplicity, we have a quick way to generate a Neuron workload through using nccom-test. nccom-test is a benchmarking tool which is already available with Neuron AMI.

export NEURON_RT_INSPECT_ENABLE=1
export NEURON_RT_INSPECT_OUTPUT_DIR=./output
nccom-test allr allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512

Note

If you have problems with nccom-test add the –debug flag. If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores.

To understand the profiling output see this section: Inspect Output

CLI reference for System Profiles#

In addition to controlling profiling with environment variables, you can use the neuron-profile inspect command line interface for profiling applications. This provides the same functionality as environment variables but helps you avoid typos, invalid arguments, and provides a useful --help command to explain available options.

Usage:
neuron-profile [OPTIONS] inspect [inspect-OPTIONS] [userscript...]

Application Options:
-v, --version                      Show version and exit

Help Options:
-h, --help                         Show this help message

[inspect command options]
    -o, --output-dir=              Output directory for the captured profile data, including system and device profiles (default: ./output)
    -s, --start-offset-ns=         Time in nanoseconds to wait between launching the workload and starting profile capture. A value of 0 starts profiling immediately.
                                    (default: 0)
    -d, --duration-ns=             Duration in nanoseconds of the profile capture session. After this time, the profiler will detach from the running workload. A
                                    value of 0 will run the profiler for the entire duration of the application. (default: 0)
    -n, --num-trace-events=        Maximum number of trace events to capture when profiling. Once hitting this limit, no new events are recorded
    -p, --dump-period-sec=         Period in seconds to dump the trace data to disk. A value of 0 will disable periodic dumping. Useful for seeing partial profiling
                                    results for long-running workloads. (default: 30)
        --capture-system-profiles  Disable capture of system profile data. Can reduce output size.
        --capture-device-profiles  Disable capture of device profile data. Can reduce output size.

[inspect command arguments]
userscript:                        Run command/script that launches a Neuron workload. E.g. 'python app.py' or './runscript.sh'

Example of using System Profiles CLI#

User can provide any type of their own script to generate a Neuron workload such as Pytorch to the System Profiles CLI. For simplicity, we have a quick way to generate a Neuron workload through using nccom-test. nccom-test is a benchmarking tool which is already available with Neuron AMI and aws-neuronx-tools package.

ubuntu@ip-172-31-63-210:~$ neuron-profile inspect -o inspect-output-nccom-test nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512
INFO[0000] Running command "nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512" with profiling enabled
    size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
    524288          131072    fp32           24.15          21.71          21.03
Avg bus bandwidth:    21.0339GB/s

Note

If you have problems with nccom-test add the –debug flag. If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores.

neuron-profile inspect Output#

The above command shows a Neuron workload execution is being traced and output to inspect-output-nccom-test directory. You will see the output directory contains a single NEFF file and a device profile (NTFF) for all Neuron Cores which executed that NEFF. You will also see ntrace.pb and trace_info.pb files storing the system profile data. Below showing what the outputs will look like:

ubuntu@ip-172-31-63-210:~$ tree inspect-output-nccom-test
inspect-output-nccom-test
    ├── i-012590440bb9fd263_pid_98399
       ├── 14382885777943380728_instid_0_vnc_0.ntff
       ├── 14382885777943380728_instid_0_vnc_1.ntff
       ├── 14382885777943380728_instid_0_vnc_10.ntff
       ├── 14382885777943380728_instid_0_vnc_11.ntff
    ...
       ├── 14382885777943380728_instid_0_vnc_8.ntff
       ├── 14382885777943380728_instid_0_vnc_9.ntff
       ├── cpu_util.pb
       ├── host_mem.pb
       ├── neff_14382885777943380728.neff
       ├── ntrace.pb
       └── trace_info.pb
    └──

2 directories, 74 files

To view a summary of the captured profile data run the command

neuron-profile view -d inspect-output-nccom-test --output-format summary-text

EKS User Experience#

Capturing a profile on EKS is most easily done through setting of environment variables as described in the section Non-framework specific User Experience. By using environment variables, users do not need to change application code in their container image or modify their run commands.

Update the deployment yaml to include the NEURON_RT_INSPECT_ENABLE and NEURON_RT_INSPECT_OUTPUT_DIR environment variables. For distributed workloads, it’s important that NEURON_RT_INSPECT_OUTPUT_DIR points to a directory on a shared volume which all workers have access to.

apiVersion: v1
kind: Pod
metadata:
name: trn1-mlp
spec:
restartPolicy: Never
schedulerName: default-scheduler
nodeSelector:
    beta.kubernetes.io/instance-type: trn1.32xlarge
containers:
    - name: trn1-mlp
    env:
        - name: NEURON_RT_INSPECT_ENABLE
        value: "1"
        - name: NEURON_RT_INSPECT_OUTPUT_DIR
        value: "/shared/output"
    command: ['torchrun']
    args:
        - '--nnodes=1'
        - '--nproc_per_node=32'
        - 'train_torchrun.py'
    image: ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:mlp
    imagePullPolicy: IfNotPresent
    resources:
        limits:
        aws.amazon.com/neuron: 16

Note

EKS users running PyTorch and JAX applications are still free to change their application code and use the PyTorch or JAX Python profiling APIs if they want finer-grained control over profiling. However, using the environment variables conveniently allows profiling without modifying the container image or application code.

Processing and Viewing Profiles#

Users have three output options for interacting with their captured profiles

  • Neuron Profiler UI - Neuron’s custom UI which allows easily drilling down to detailed device profiles from high level system profiles

  • Perfetto - Allows sharing profiles as a single file and viewing your profiles in the Perfetto UI at https://ui.perfetto.dev/

  • JSON - human-readable text output that enables simple scripting

Neuron Profiler UI#

To view a profile in the Neuron Profiler UI run the following command to process a profile and launch the UI

neuron-profile view -d ./output

To view profiles with the Neuron Profiler UI running locally you will need to have InfluxDB installed on your system. To install and setup InfluxDB follow the directions in the official Neuron Profile documentation.

Neuron Profiler System Profile UI#

The system profile timeline shows a trace of Neuron Runtime API calls, ML framework function calls, CPU utilization, and memory usage on each of the instances in your workload. The Neuron Runtime API trace is grouped by NeuronCore IDX and ec2 instance ID. For example, all events in the row labeled nrt-nc-003-i-0f207fb2a99bd2d08 are associated with NeuronCore 3 and instance i-0f207fb2a99bd2d08.

Framework function traces are grouped by thread id and ec2 instance id. For example, all events in the row framework-3266405268-i-0f207fb2a99bd2d08 are framework or application function calls made on thread 3266405268 running on instance i-0f207fb2a99bd2d08.

neuron-profiler2-annotate-system-ui

Clicking on trace events in the timeline shows a “Event attributes” view with a list of attributes associated with that event. For example, clicking on an nrt_execute event (the Neuron Runtime API call for executing a compiled model on a NeuronCore) will show events such as Flop count (the number of floating point operations for a single execution of the model), the model name, and the NeuronCore idx and ec2 instance id associated with the function call.

neuron-profiler2-attributes-window

Neuron Profiler 2.0 allows users to drill-down from a system timeline to a device profile timeline in order to see a detailed view of hardware activity during the execution of a graph. To do this, select an nrt_execute event in the timeline and in the “Event attributes” view select the “Open device profile” button under the Model Name attribute. This will open a new window with a device profile. For help understanding a device profile see the section documentation section “Understanding a Neuron Profile”

neuron-profiler2-drilldown-device

To see a list of all device profiles that were captured during your workload press the “Device Profiles” button at the bottom of the timeline. From this list you can see all unique compiled graphs (NEFFs) that were executed on NeuronCores during your workload. For each graph there is a link to a device profile that will show a detailed view of hardware activity on the NeuronCore during execution of this graph.

neuron-profiler2-device-profile-list

Viewing Profiles with Perfetto#

Perfetto is an open-source trace analysis toolkit with a powerful UI for visualizing and analyzing trace data. Users of Neuron Profiler have the option of viewing their profiles in the Perfetto UI.

The --output-format perfetto option writes processed data to Perfetto’s native protobuf-based tracing format which can be visualized in the Perfetto UI at https://ui.perfetto.dev/.

Example:

neuron-profile view -d ./output --output-format perfetto

This will generate a system_profile.pftrace file for the system profile and a device_profile_model_<model_id>.pftrace file for each unique compiled model that was executed on a Neuron Device.

To view the system profile, go to https://ui.perfetto.dev/ and open the system_profile.pftrace file.

Note

When loading trace files in the Perfetto UI, your data is processed locally and not uploaded to Perfetto’s servers.

neuron-profiler2-perfetto-timeline

To view a device profile go to https://ui.perfetto.dev/ and open the device_profile_model_<model_id>.pftrace file. This will show a detailed view of hardware activity on the NeuronCore during execution of this graph.

neuron-profiler2-perfetto-device-timeline

Note

Your browser may run out of memory when viewing *.pftrace (Perfetto trace) files that are more than a few hundred MB. See the section Viewing Large Profiles in Perfetto for directions on how to view large traces using the trace processor.

Generating JSON Output From Profiles#

The --output-format json option writes processed profile data to human-readable JSON that can be used for scripting and manual inspection.

neuron-profile view -d ./output --output-format json

This will generate a system_profile.json file containing the system profile data and a device_profile_model_<model_id>.json file for each unique compiled model that was executed on a Neuron Device.

The system_profile.json JSON contains the following data types:

  • trace_events: Neuron Runtime API trace events and Framework/Application trace events containing timestamps, durations, names, and the ec2 instance-id to differentiate between events from different compute nodes in a distributed workload.

{
    "Neuron_Runtime_API_Event": {
        "duration": 27094,
        "group": "nrt-nc-000",
        "id": 1,
        "instance_id": "i-0f207fb2a99bd2d08",
        "lnc_idx": "0",
        "name": "nrt_tensor_write",
        "parent_id": 0,
        "process_id": "1627711",
        "size": "4",
        "tensor_id": "4900392441224765051",
        "tensor_name": "_unknown_",
        "thread_id": 1627711,
        "timestamp": 1729888371056597613,
        "type": 11
    },
    "Framework_Event": {
        "duration": 3758079,
        "group": "framework-80375131",
        "instance_id": "i-0f207fb2a99bd2d08",
        "name": "PjitFunction(matmul_allgather)",
        "process_id": "701",
        "thread_id": 80375131,
        "timestamp": 1729888382798557372,
        "type": 99999
    }
}
  • mem_usage: sampled host memory usage

{
    "duration": 1,
    "instance_id": "i-0f207fb2a99bd2d08",
    "percent_usage": 9.728179797845964,
    "timestamp": 1729888369286687792,
    "usage": 51805806592
}
  • cpu_util: sampled CPU utilization. Results are provided per core and per ec2 instance involved in a distributed workload

{
    "cpu_id": "47",
    "duration": 1,
    "instance_id": "i-0f207fb2a99bd2d08",
    "timestamp": 1729888371287337243,
    "util": 2.3255813
},

Processing only system or device profiles#

To reduce processing times it is possible to skip processing of system or device profiles. Sometimes users may only be interested in one or want to start with a limited set of profiling data before exploring the full profile.

To skip processing of device profiles use the --ignore-device-profile option. To skip processing of system profiles use the --ignore-system-profile option. These options can be used with the --output-format values db (default), perfetto, or json.

For example:

neuron-profile view -d ./output --ignore-device-profile --output-format perfetto

This document is relevant for: Inf1, Inf2, Trn1, Trn2