.. _neuron-profiler-2-0-guide: Neuron Profiler 2.0 (Beta) User Guide ===================================== Overview -------- Neuron Profiler 2.0 offers a user-friendly experience for capturing and analyzing application performance through both high-level system profiles and detailed device-level profiles. Users can profile their workloads using framework-specific APIs within their application code or by setting an environment variable before execution. This tool supports profiling for both single-node and distributed workloads, integrating with environments such as ParallelCluster and EKS. Once captured, profile results can be explored through multiple interfaces: the Neuron Profiler UI, the open-source trace viewer `Perfetto `_, or by exporting to a human-readable JSON format. This flexibility in data capture and visualization enables users to gain comprehensive insights into their application's performance across various scenarios and scales. .. important:: The Neuron Profiler will be replaced by the new Neuron Explorer in a future release. For more details and migration guidance, see :ref:`neuron-explorer-faq`. .. note:: Neuron Profiler 2.0 is a set of new features currently in beta that enhance and simplify the experience of capturing and viewing profiles. It is not a replacement of :ref:`Neuron Profiler `, which is the existing feature set specifically for capturing and viewing device profiles. .. _system-profiles-overview: Key benefits ~~~~~~~~~~~~ - End-to-end timing of model execution and a Neuron Runtime API trace across all workers, helping identify scheduling gaps, synchronization, and host/runtime overheads. - No extra device memory usage by default, making system profiles ideal when device memory is limited or when only high-level insights are needed. - Option to capture device profiles for individual models during your workload. - Flexible capture and viewing: enable via environment variables or framework APIs; view in the Neuron Profiler UI, in Perfetto, or export as JSON. Capturing profiles ------------------ Neuron Profiler 2.0 offers several flexible options for capturing profiles. Users can either set an environment variable ``NEURON_RT_INSPECT_ENABLE`` or use the PyTorch or JAX profiling APIs from their application code for fine-grained control over which sections of their code are profiled. PyTorch and JAX users who prefer not to modify their application code can still enable profiling by setting the environment variable before running their application. JAX User Experience ------------------- JAX Setup ~~~~~~~~~~~~ Follow the :ref:`JAX Setup ` instructions to install the required JAX Neuron Plugin and the latest Neuron Driver, Runtime and Tools packages. JAX Profiler ~~~~~~~~~~~~ The JAX context-managed profiling API allows you to profile blocks of code. This will capture a system profile including a Neuron Runtime API trace and Python trace for your application code in the captured block. This will also capture device profiles for any compiled graphs (NEFFs) executed on NeuronCores within this block. To use the profiler, import the ``jax`` package. .. code-block:: python import jax Profiling is enabled for all code enclosed in the context when using ``with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]):`` .. note:: It is important to pass the output directory ``os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]`` to ``with jax.profiler.trace`` and run ``export NEURON_RT_INSPECT_OUTPUT_DIR=`` before enabling profiling. This ensures all captured profile data is saved to the correct output directory. Custom Annotations in JAX ~~~~~~~~~~~~~~~~~~~~~~~~~ To add custom annotations to blocks of code in your profile, you can use ``jax.profiler.TraceAnnotation``. Annotation names can be created at runtime, such as in the :ref:`example here ` using ``with jax.profiler.TraceAnnotation("my_label"+str(i)):``. For more information on TraceAnnotations, see the official `JAX documentation `_. JAX Profiling using environment variable ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of using the jax.profiler context manager, you can enable profiling for your entire application using an environment variable. This is desirable if you want to capture a profile without modifying your application code. To enable profiling with the environment variable ``NEURON_RT_INSPECT_ENABLE=1`` and ``NEURON_RT_INSPECT_OUTPUT_DIR=./output`` before running your application. For example: .. code-block:: shell # make sure to remove call to with jax.profiler.trace from python script NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python jax_script.py When using the ``NEURON_RT_INSPECT_ENABLE`` environment variable instead of ``jax.profiler``, system profiles will not contain a framework and application code trace, only Neuron Runtime API trace. Do not set the ``NEURON_RT_INSPECT_ENABLE`` environment variable and use the ``jax.profiler`` within your application code at the same time. Use one or the other. For more profiling options that can be set through environment variables, see the section :ref:`Profile Capture Environment Variables `. .. _neuron-profile-full-jax-example: Full JAX Example ~~~~~~~~~~~~~~~~ Create a file ``jax_script.py`` which performs repeated matrix multiplications distributed across Neuron devices. .. code-block:: python from functools import partial import os import jax import jax.numpy as jnp from jax.sharding import Mesh, NamedSharding, PartitionSpec as P from jax.experimental.shard_map import shard_map from time import sleep os.environ["XLA_FLAGS"] = "--xla_dump_hlo_snapshots --xla_dump_to=./dump" jax.config.update("jax_default_prng_impl", "rbg") mesh = Mesh(jax.devices(), ('i',)) def device_put(x, pspec): return jax.device_put(x, NamedSharding(mesh, pspec)) lhs_spec = P('i', None) lhs = device_put(jax.random.normal(jax.random.key(0), (128, 128)), lhs_spec) rhs_spec = P('i', None) rhs = device_put(jax.random.normal(jax.random.key(1), (128, 16)), rhs_spec) @jax.jit @partial(shard_map, mesh=mesh, in_specs=(lhs_spec, rhs_spec), out_specs=rhs_spec) def matmul_allgather(lhs_block, rhs_block): rhs = jax.lax.all_gather(rhs_block, 'i', tiled=True) return lhs_block @ rhs with jax.profiler.trace(os.environ["NEURON_RT_INSPECT_OUTPUT_DIR"]): out = matmul_allgather(lhs, rhs) for i in range(10): with jax.profiler.TraceAnnotation("my_label"+str(i)): out = matmul_allgather(lhs, rhs) sleep(0.001) expected = lhs @ rhs with jax.default_device(jax.devices('cpu')[0]): equal = jnp.allclose(jax.device_get(out), jax.device_get(expected), atol=1e-3, rtol=1e-3) print("Tensors are the same") if equal else print("Tensors are different") Set your profile output directory and run the script: .. code-block:: shell export NEURON_RT_INSPECT_OUTPUT_DIR=./output python jax_script.py PyTorch User Experience ----------------------- PyTorch Setup ~~~~~~~~~~~~~ Follow the :ref:`PyTorch Setup ` instructions to install the required PyTorch Neuron packages as well as the latest Neuron Driver, Runtime and Tools. PyTorch Profiler ~~~~~~~~~~~~~~~~ The PyTorch context-managed profiling API allows you to profile blocks of code. This will capture a system profile including a Neuron Runtime API trace and Python trace for your application code in the captured block. This will also capture device profiles for any compiled graphs executed on NeuronCores within this block. To use the profiler, import it in your application: .. code-block:: python from torch_neuronx.experimental import profiler Then profile a block of code using: .. code-block:: python with torch_neuronx.experimental.profiler.profile( port=9012, profile_type='system', target='neuron_profile_perfetto', output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'], ms_duration=30000) as profiler: After modifying your code to call the profiler, run your application as you normally would but set the environment variable ``NEURON_RT_INSPECT_OUTPUT_DIR`` to specify the output directory. .. code-block:: shell NEURON_RT_INSPECT_OUTPUT_DIR=./output python application.py .. note:: it is essential to set ``output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR']`` when starting the profiler from your application code. This ensures that all profile data sources dump to the same output directory. PyTorch Profiling using Environment Variable ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of using the ``torch_neuronx.experimental.profiler.profile`` context manager, you can enable profiling for your entire application using environment variable. This is desirable if you want to capture a profile without modifying your application code. To enable profiling with environment variable ``NEURON_RT_INSPECT_ENABLE=1`` and ``NEURON_RT_INSPECT_OUTPUT_DIR=./output`` before running your application. For example .. code-block:: shell # make sure to remove call to with torch_neuronx.experimental.profiler.profile from python script NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python pytorch_script.py When using the ``NEURON_RT_INSPECT_ENABLE`` environment variable instead of ``torch_neuronx.experimental.profiler.profile`` system profiles will not contain a framework and application code trace, only Neuron Runtime API trace. Do not set the ``NEURON_RT_INSPECT_ENABLE`` environment variable and use the ``torch_neuronx.experimental.profiler.profile`` within your application code at the same time. Use one or the other. For more profiling options that can be set through environment variables, see the section :ref:`Profile Capture Environment Variables `. Full PyTorch Example ~~~~~~~~~~~~~~~~~~~~ Create a file ``train_torchrun_context.py`` with the following contents .. code-block:: python import os import torch import torch.nn as nn import torch.nn.functional as F # XLA imports import torch_xla import torch_xla.core.xla_model as xm import torch_xla.debug.profiler as xp import torch_neuronx from torch_neuronx.experimental import profiler os.environ["NEURON_CC_FLAGS"] = "--cache_dir=./compiler_cache" # Global constants EPOCHS = 2 # Declare 3-layer MLP Model class MLP(nn.Module): def __init__(self, input_size=10, output_size=2, layers=[5, 5]): super(MLP, self).__init__() self.fc1 = nn.Linear(input_size, layers[0]) self.fc2 = nn.Linear(layers[0], layers[1]) self.fc3 = nn.Linear(layers[1], output_size) def forward(self, x): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return F.log_softmax(x, dim=1) def main(): # Fix the random number generator seeds for reproducibility torch.manual_seed(0) # XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance) device = xm.xla_device() # Start the profiler context-manager with torch_neuronx.experimental.profiler.profile( port=9012, profile_type='system', target='neuron_profile_perfetto', output_dir=os.environ['NEURON_RT_INSPECT_OUTPUT_DIR'], ms_duration=30000) as profiler: # IMPORTANT: the model has to be transferred to XLA within # the context manager, otherwise profiling won't work model = MLP().to(device) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) loss_fn = torch.nn.NLLLoss() # start training loop print('----------Training ---------------') model.train() for epoch in range(EPOCHS): optimizer.zero_grad() train_x = torch.randn(1, 10).to(device) train_label = torch.tensor([1]).to(device) # forward loss = loss_fn(model(train_x), train_label) # back loss.backward() optimizer.step() # XLA: collect ops and run them in XLA runtime xm.mark_step() print('----------End Training ---------------') if __name__ == '__main__': main() Run this workload with the following command: .. code-block:: shell NEURON_RT_INSPECT_OUTPUT_DIR="output" python simple_demo.py .. _neuron-profiler-non-framework-user-experience: Non-framework Specific User Experience -------------------------------------- You can also control profiling with environment variables. This is useful when you can’t easily change your application code, such as when running an executable which calls the Neuron Runtime or in a containerized environment where the application code is built into the container image. .. _neuron-profiler-capture-environment-variables: Profile Capture Environment Variables -------------------------------------- .. _core-control-variables: Core control variables ~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: auto :header-rows: 1 :align: left * - Variable - Description - Default behavior * - ``NEURON_RT_INSPECT_ENABLE`` - Set to ``1`` to enable profiling - Enables system profiling and disables device profiling. To control which profile types are captured, see :ref:`Profile type selection ` * - ``NEURON_RT_INSPECT_OUTPUT_DIR`` - Directory for profile data output - Default directory for captured profile data is ``./output`` .. _profile-type-selection: Profile type selection ~~~~~~~~~~~~~~~~~~~~~~~ .. note:: When ``NEURON_RT_INSPECT_ENABLE`` set to ``1``, ``NEURON_RT_INSPECT_SYSTEM_PROFILE`` is enabled by default (set to 1) and ``NEURON_RT_INSPECT_DEVICE_PROFILE`` is disabled by default (set to ``0``). When ``NEURON_RT_INSPECT_ENABLE`` = 1, two different profile types are available: .. list-table:: :widths: auto :header-rows: 1 :align: left * - Variable - Profile type - Description - Enable capture - Disable capture * - ``NEURON_RT_INSPECT_SYSTEM_PROFILE`` - System-level - Captures runtime system events and operations - Set to ``1`` - Set to ``0`` * - ``NEURON_RT_INSPECT_DEVICE_PROFILE`` - Device-level - Captures detailed NeuronCore hardware metrics - Set to ``1`` - Set to ``0`` .. note:: These variables have no effect if ``NEURON_RT_INSPECT_ENABLE`` is not set to ``1``. .. _advanced-config-vars: Advanced configuration ~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :widths: auto :header-rows: 1 :align: left * - Variable - Profile type - Description - Default behavior * - ``NEURON_RT_INSPECT_SYS_TRACE_MAX_EVENTS_PER_NC`` - System-level - Maximum trace events per NeuronCore before oldest events are overwritten - 1,000,000 .. note:: Increasing the event limit will consume more host memory. Example Capturing Profile of Application Using Environment Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of using the PyTorch or JAX profilers you can profile your Python application (or any application calling the Neuron Runtime API) using environment variables. .. code-block:: shell NEURON_RT_INSPECT_ENABLE=1 NEURON_RT_INSPECT_OUTPUT_DIR=./output python app.py See :ref:`Profile Capture Environment Variables ` for other profiling options that can be set via environment variable. Example Capturing Profile of nccom-test Using Environment Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Profiling can be enabled using environment variables. For simplicity, we have a quick way to generate a Neuron workload through using :ref:`nccom-test `. nccom-test is a benchmarking tool which is already available with Neuron AMI. .. code-block:: shell export NEURON_RT_INSPECT_ENABLE=1 export NEURON_RT_INSPECT_OUTPUT_DIR=./output nccom-test allr allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512 .. note:: If you have problems with nccom-test add the --debug flag. If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores. To understand the profiling output see this section: :ref:`Inspect Output ` CLI reference for System Profiles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to controlling profiling with environment variables, you can use the ``neuron-profile inspect`` command line interface for profiling applications. This provides the same functionality as environment variables but helps you avoid typos, invalid arguments, and provides a useful ``--help`` command to explain available options. .. code-block:: shell Usage: neuron-profile [OPTIONS] inspect [inspect-OPTIONS] [userscript...] Application Options: -v, --version Show version and exit Help Options: -h, --help Show this help message [inspect command options] -o, --output-dir= Output directory for the captured profile data, including system and device profiles (default: ./output) -n, --num-trace-events= Maximum number of trace events to capture when profiling. Once hitting this limit, no new events are recorded --capture-system-profiles Disable capture of system profile data. Can reduce output size. --capture-device-profiles Disable capture of device profile data. Can reduce output size. [inspect command arguments] userscript: Run command/script that launches a Neuron workload. E.g. 'python app.py' or './runscript.sh' Example of using System Profiles CLI ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ User can provide any type of their own script to generate a Neuron workload such as Pytorch to the System Profiles CLI. For simplicity, we have a quick way to generate a Neuron workload through using ``nccom-test``. ``nccom-test`` is a benchmarking tool which is already available with Neuron AMI and ``aws-neuronx-tools`` package. .. code-block:: shell ubuntu@ip-172-31-63-210:~$ neuron-profile inspect -o inspect-output-nccom-test nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512 INFO[0000] Running command "nccom-test allg -b 512kb -e 512kb -r 32 -n 10 -d fp32 -w 1 -f 512" with profiling enabled size(B) count(elems) type time:avg(us) algbw(GB/s) busbw(GB/s) 524288 131072 fp32 24.15 21.71 21.03 Avg bus bandwidth: 21.0339GB/s .. note:: If you have problems with nccom-test add the --debug flag. If using a trn1.2xlarge instance, change -r 32 to -r 2 to use fewer neuron cores. .. _neuron-profiler-inspect-output: ``neuron-profile inspect`` Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The above command shows a Neuron workload execution is being traced and output to ``inspect-output-nccom-test`` directory. You will see the output directory contains a single NEFF file and a device profile (NTFF) for all Neuron Cores which executed that NEFF. You will also see ``ntrace.pb`` and ``trace_info.pb`` files storing the system profile data. Below showing what the outputs will look like: .. code-block:: shell ubuntu@ip-172-31-63-210:~$ tree inspect-output-nccom-test inspect-output-nccom-test ├── i-012590440bb9fd263_pid_98399 │ ├── 14382885777943380728_instid_0_vnc_0.ntff │ ├── 14382885777943380728_instid_0_vnc_1.ntff │ ├── 14382885777943380728_instid_0_vnc_10.ntff │ ├── 14382885777943380728_instid_0_vnc_11.ntff ... │ ├── 14382885777943380728_instid_0_vnc_8.ntff │ ├── 14382885777943380728_instid_0_vnc_9.ntff │ ├── cpu_util.pb │ ├── host_mem.pb │ ├── neff_14382885777943380728.neff │ ├── ntrace.pb │ └── trace_info.pb └── 2 directories, 74 files To view a summary of the captured profile data run the command .. code-block:: shell neuron-profile view -d inspect-output-nccom-test --output-format summary-text EKS User Experience ------------------- Capturing a profile on EKS is most easily done through setting of environment variables as described in the section :ref:`Non-framework specific User Experience `. By using environment variables, users do not need to change application code in their container image or modify their run commands. Update the deployment yaml to include the ``NEURON_RT_INSPECT_ENABLE`` and ``NEURON_RT_INSPECT_OUTPUT_DIR`` environment variables. For distributed workloads, it’s important that ``NEURON_RT_INSPECT_OUTPUT_DIR`` points to a directory on a shared volume which all workers have access to. .. code-block:: yaml apiVersion: v1 kind: Pod metadata: name: trn1-mlp spec: restartPolicy: Never schedulerName: default-scheduler nodeSelector: beta.kubernetes.io/instance-type: trn1.32xlarge containers: - name: trn1-mlp env: - name: NEURON_RT_INSPECT_ENABLE value: "1" - name: NEURON_RT_INSPECT_OUTPUT_DIR value: "/shared/output" command: ['torchrun'] args: - '--nnodes=1' - '--nproc_per_node=32' - 'train_torchrun.py' image: ${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO}:mlp imagePullPolicy: IfNotPresent resources: limits: aws.amazon.com/neuron: 16 .. note:: EKS users running PyTorch and JAX applications are still free to change their application code and use the PyTorch or JAX Python profiling APIs if they want finer-grained control over profiling. However, using the environment variables conveniently allows profiling without modifying the container image or application code. Processing and Viewing Profiles ------------------------------- Users have three output options for interacting with their captured profiles * Neuron Profiler UI - Neuron’s custom UI which allows easily drilling down to detailed device profiles from high level system profiles * Perfetto - Allows sharing profiles as a single file and viewing your profiles in the Perfetto UI at https://ui.perfetto.dev/ * JSON - human-readable text output that enables simple scripting Neuron Profiler UI ~~~~~~~~~~~~~~~~~~ To view a profile in the Neuron Profiler UI run the following command to process a profile and launch the UI .. code-block:: shell neuron-profile view -d ./output To view profiles with the Neuron Profiler UI running locally you will need to have InfluxDB installed on your system. To install and setup InfluxDB follow the :ref:`directions in the official Neuron Profile documentation `. Neuron Profiler System Profile UI ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The system profile timeline shows a trace of Neuron Runtime API calls, ML framework function calls, CPU utilization, and memory usage on each of the instances in your workload. The Neuron Runtime API trace is grouped by NeuronCore IDX and ec2 instance ID. For example, all events in the row labeled nrt-nc-003-i-0f207fb2a99bd2d08 are associated with NeuronCore 3 and instance i-0f207fb2a99bd2d08. Framework function traces are grouped by thread id and ec2 instance id. For example, all events in the row framework-3266405268-i-0f207fb2a99bd2d08 are framework or application function calls made on thread 3266405268 running on instance i-0f207fb2a99bd2d08. |neuron-profiler2-annotate-system-ui| Clicking on trace events in the timeline shows a “Event attributes” view with a list of attributes associated with that event. For example, clicking on an nrt_execute event (the Neuron Runtime API call for executing a compiled model on a NeuronCore) will show events such as Flop count (the number of floating point operations for a single execution of the model), the model name, and the NeuronCore idx and ec2 instance id associated with the function call. |neuron-profiler2-attributes-window| Neuron Profiler 2.0 allows users to drill-down from a system timeline to a device profile timeline in order to see a detailed view of hardware activity during the execution of a graph. To do this, select an nrt_execute event in the timeline and in the “Event attributes” view select the "Open device profile" button under the Model Name attribute. This will open a new window with a device profile. For help understanding a device profile see the section documentation section "Understanding a Neuron Profile" |neuron-profiler2-drilldown-device| To see a list of all device profiles that were captured during your workload press the “Device Profiles” button at the bottom of the timeline. From this list you can see all unique compiled graphs (NEFFs) that were executed on NeuronCores during your workload. For each graph there is a link to a device profile that will show a detailed view of hardware activity on the NeuronCore during execution of this graph. |neuron-profiler2-device-profile-list| Viewing Profiles with Perfetto ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Perfetto is an open-source trace analysis toolkit with a powerful UI for visualizing and analyzing trace data. Users of Neuron Profiler have the option of viewing their profiles in the Perfetto UI. The ``--output-format perfetto`` option writes processed data to Perfetto's native protobuf-based tracing format which can be visualized in the Perfetto UI at https://ui.perfetto.dev/. Example: .. code-block:: shell neuron-profile view -d ./output --output-format perfetto This will generate a ``system_profile.pftrace`` file for the system profile and a ``device_profile_model_.pftrace`` file for each unique compiled model that was executed on a Neuron Device. To view the system profile, go to https://ui.perfetto.dev/ and open the ``system_profile.pftrace`` file. .. note:: When loading trace files in the Perfetto UI, your data is processed locally and not uploaded to Perfetto’s servers. |neuron-profiler2-perfetto-timeline| To view a device profile go to https://ui.perfetto.dev/ and open the ``device_profile_model_.pftrace`` file. This will show a detailed view of hardware activity on the NeuronCore during execution of this graph. |neuron-profiler2-perfetto-device-timeline| .. note:: Your browser may run out of memory when viewing ``*.pftrace`` (Perfetto trace) files that are more than a few hundred MB. See the section :ref:`Viewing Large Profiles in Perfetto ` for directions on how to view large traces using the trace processor. Perfetto Output View Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When outputting to Perfetto it is possible to group your traces by different attributes. This is useful for larger profiles involving many NeuronCores and instances. The following options are available: .. list-table:: Perfetto output view options :header-rows: 1 :widths: 30 70 * - CLI option - Description * - ``--system-trace-primary-group`` - First-order grouping of trace events (maps to a Perfetto process / process group of rows). Provide a comma-delimited list of field names. Allowed fields: ``instance_id``, ``thread_id``, ``lnc_idx``, ``process_id``. Default: ``instance_id,process_id``. * - ``--system-trace-secondary-group`` - Second-order grouping of trace events (maps to a Perfetto thread / single row). Provide a comma-delimited list of field names. Allowed fields: ``instance_id``, ``worker_gid``, ``thread_id``, ``lnc_idx``, ``process_id``. Default: ``worker_gid,lnc_idx, thread_id``. For example, the following profile uses ``neuron-profile view --output-format=perfetto --system-trace-primary-group=instance_id,process_id --system-trace-secondary-group=lnc_idx,thread_id`` to group the system profile first by unique combinations of instance_id and process_id, and then in each of those groups there are rows of events with unique combinations of lnc_idx and thread_id. |neuron-profiler2-perfetto-grouping| Grouping By Global Worker ID ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, Perfetto traces are grouped by ``worker_gid`` which is a unique global identifier for each NeuronCore across all instances in a distributed workload. When clicking on an event in the trace you will see fields for both ``lnc_idx`` (local NeuronCore index on that process) and ``worker_gid`` (global NeuronCore index across all instances). It is possible for ``lnc_idx`` to be the same for different processes on the same instance or across different instances in a distributed workload. However, ``worker_gid`` is unique for each NeuronCore across all instances. The image below shows how to correlate the naming of tracks (rows) in the Perfetto UI to both ``lnc_idx`` and ``worker_gid``. |neuron-profiler2-perfetto-gid| Generating JSON Output From Profiles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``--output-format`` json option writes processed profile data to human-readable JSON that can be used for scripting and manual inspection. .. code-block:: shell neuron-profile view -d ./output --output-format json This will generate a ``system_profile.json`` file containing the system profile data and a ``device_profile_model_.json`` file for each unique compiled model that was executed on a Neuron Device. The system_profile.json JSON contains the following data types: * ``trace_events``: Neuron Runtime API trace events and Framework/Application trace events containing timestamps, durations, names, and the ec2 instance-id to differentiate between events from different compute nodes in a distributed workload. .. code-block:: json { "Neuron_Runtime_API_Event": { "duration": 27094, "group": "nrt-nc-000", "id": 1, "instance_id": "i-0f207fb2a99bd2d08", "lnc_idx": "0", "name": "nrt_tensor_write", "parent_id": 0, "process_id": "1627711", "size": "4", "tensor_id": "4900392441224765051", "tensor_name": "_unknown_", "thread_id": 1627711, "timestamp": 1729888371056597613, "type": 11 }, "Framework_Event": { "duration": 3758079, "group": "framework-80375131", "instance_id": "i-0f207fb2a99bd2d08", "name": "PjitFunction(matmul_allgather)", "process_id": "701", "thread_id": 80375131, "timestamp": 1729888382798557372, "type": 99999 } } * ``mem_usage``: sampled host memory usage .. code-block:: json { "duration": 1, "instance_id": "i-0f207fb2a99bd2d08", "percent_usage": 9.728179797845964, "timestamp": 1729888369286687792, "usage": 51805806592 } * ``cpu_util``: sampled CPU utilization. Results are provided per core and per ec2 instance involved in a distributed workload .. code-block:: json { "cpu_id": "47", "duration": 1, "instance_id": "i-0f207fb2a99bd2d08", "timestamp": 1729888371287337243, "util": 2.3255813 }, Processing only system or device profiles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To reduce processing times it is possible to skip processing of system or device profiles. Sometimes users may only be interested in one or want to start with a limited set of profiling data before exploring the full profile. To skip processing of device profiles use the ``--ignore-device-profile`` option. To skip processing of system profiles use the ``--ignore-system-profile`` option. These options can be used with the ``--output-format`` values ``db`` (default), ``perfetto``, or ``json``. For example: .. code-block:: shell neuron-profile view -d ./output --ignore-device-profile --output-format perfetto .. _neuron-profiler-filtering-system-profiles: Filtering System Profiles -------------------------- This guide explains how to filter system trace events to optimize memory usage, reduce output size, and speed up trace processing. **Capture-time filtering** reduces memory usage and trace file size by only collecting specific events, but filtered data cannot be recovered later. **Processing-time filtering** preserves the complete trace and allows flexible analysis with different filters, but requires more memory and storage during capture. Capture-Time Filtering ~~~~~~~~~~~~~~~~~~~~~~ Configure filters before trace capture using environment variables or API functions. You can use NeuronCore filters to only capture events for specific NeuronCores (for example only events associated with NeuronCore 0 or all the NeuronCores on a specific NeuronDevice). You can use event type filters to only capture specific events (for example model execute or collectives events). It is possible to combine both NeuronCore and event type filters. Filtering by NeuronCore ^^^^^^^^^^^^^^^^^^^^^^^ If capture is enabled for a NeuronCore then a ring buffer will be allocated in host memory for storing those core's events. Thus filtering by NeuronCore decreases host memory usage during capture. Default Behavior """"""""""""""""" By default, all visible NeuronCores are enabled for capture. Using Environment Variables """"""""""""""""""""""""""" .. code-block:: shell # Filter to capture events only from NeuronCore 0 export NEURON_RT_INSPECT_EVENT_FILTER_NC=0 # Filter to capture events from NeuronCores 0, 2, and 4 export NEURON_RT_INSPECT_EVENT_FILTER_NC=0,2,4 # Filter to capture events from a range of NeuronCores (0 through 3) export NEURON_RT_INSPECT_EVENT_FILTER_NC=0-3 # Reset to default behavior unset NEURON_RT_INSPECT_EVENT_FILTER_NC # Back to capturing all visible cores Using API Functions """"""""""""""""""" .. code-block:: c #include // Allocate and configure trace options nrt_sys_trace_config_t *config; nrt_sys_trace_config_allocate(&config); nrt_sys_trace_config_set_defaults(config); // Enable capture only for specific NeuronCores // Disable all cores since by default they are all enabled int num_cores = 128; for (int i=0; i // Get all available event types const char **event_types = nullptr; size_t count = 0; NRT_STATUS status = nrt_sys_trace_get_event_types(&event_types, &count); if (status == NRT_SUCCESS) { printf("Available event types:\n"); for (size_t i = 0; i < count; ++i) { printf(" %s\n", event_types[i]); } // Free the event types array for (size_t i = 0; i < count; ++i) { free((void*)event_types[i]); } free((void*)event_types); } Using Environment Variables """"""""""""""""""""""""""" The ``NEURON_RT_INSPECT_EVENT_FILTER_TYPE`` environment variable supports: * **Default**: If not set, all event types are captured * **Specific event types**: Use exact event names from ``nrt_sys_trace_get_event_types()`` * **Event categories**: Use ``hardware`` or ``software`` to filter by category * **Exclusion**: Use ``^`` prefix to exclude specific events from a category .. code-block:: shell # Filter to capture only specific event types export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=nrt_load,nrt_execute,nc_exec_running # Filter to capture all hardware events export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware # Filter to capture all software events export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=software # Filter to capture all hardware events EXCEPT cc_exec export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware,^cc_running # Filter to capture all software events EXCEPT nrt_load export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=software,^nrt_load # Mix categories and specific events export NEURON_RT_INSPECT_EVENT_FILTER_TYPE=hardware,nrt_tensor_write,nrt_tensor_read # Reset to default behavior unset NEURON_RT_INSPECT_EVENT_FILTER_TYPE # Back to capturing all event types The ``hardware`` group contains events that are executed on the NeuronCore. These are ``nc_exec_running``, ``cc_running``, ``cc_exec_barrier``, ``numerical_err``, ``nrt_model_switch``, ``timestamp_sync_point``, ``hw_notify``. The ``software`` group contains all other events. Using API Functions """"""""""""""""""" Use the ``nrt_sys_trace_config_set_capture_enabled_for_event_type`` API to filter by event type. .. code-block:: c #include // Configure trace options nrt_sys_trace_config_t *config; nrt_sys_trace_config_allocate(&config); nrt_sys_trace_config_set_defaults(config); // By default, all event types are enabled // Disable specific event types (others remain enabled) nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "device_exec", false); // Or disable all first, then enable only specific ones const char **all_event_types = nullptr; size_t all_count = 0; nrt_sys_trace_get_event_types(&all_event_types, &all_count); // Disable all event types first for (size_t i = 0; i < all_count; ++i) { nrt_sys_trace_config_set_capture_enabled_for_event_type(config, all_event_types[i], false); } // Enable only specific event types nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "model_load", true); nrt_sys_trace_config_set_capture_enabled_for_event_type(config, "nrt_execute", true); // Verify which event types are enabled const char **enabled_types = nullptr; size_t enabled_count = 0; nrt_sys_trace_config_get_enabled_event_types(config, &enabled_types, &enabled_count); printf("Enabled event types: %zu\n", enabled_count); for (size_t i = 0; i < enabled_count; ++i) { printf(" %s\n", enabled_types[i]); } // Clean up memory (caller is responsible) for (size_t i = 0; i < enabled_count; ++i) { free((void*)enabled_types[i]); } free((void*)enabled_types); for (size_t i = 0; i < all_count; ++i) { free((void*)all_event_types[i]); } free((void*)all_event_types); // Start tracing nrt_sys_trace_start(config); // Your application code here... // Cleanup nrt_sys_trace_stop(); nrt_sys_trace_config_free(config); .. _neuron-profile-system-timestamp-adjustment: Adjusting Hardware Timestamps ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hardware events executed on the NeuronCore use device-specific timestamps that are in a different time domain than CPU timestamps. To enable accurate correlation between hardware and software events in the JSON system trace output, the runtime automatically adjusts hardware event timestamps to the CPU time domain using synchronization point events. How Timestamp Adjustment Works ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ System trace events are generated from multiple independent time domains: the CPU host and each ML accelerator devices operating with their own clocks. To align events from different domains, the runtime performs software-based time synchronization after event collection. **Sync Point Events**: After each execution, a special ``timestamp_sync_point`` event captures nearly simultaneous timestamps from both the host CPU (``cpu_timestamp_ns``) and the device (``nc_timestamp_ns``). These sync events are used to adjust the timestamps of hardware events to the CPU domain. These synchronization events are included in the returned event trace and serve as reference points for timestamp adjustment. Users can see the sync point used for aligning hardware events in the timeline. **Adjustment Algorithm**: For each hardware event, the runtime: - Uses the sync point with matching exec_id for that NeuronCore - Calculates the time difference between the hardware event and the sync point (in device time) - Applies that same time difference to the sync point's CPU timestamp - Formula: ``adjusted_timestamp = sync_cpu_timestamp + (event_device_timestamp - sync_device_timestamp)`` Illustration:: Sync_Point HW_Event │ │ ▼ ▼ Device Time ─●────────────────●───> |-------Δt------>| - sync_device_timestamp and sync_cpu_timestamp occur ~simultaneously, though their clocks differ CPU Time ────●────────────────●───> - Calc Δt = event_device_timestamp - sync_device_timestamp (elapsed time since sync point on device) |-------Δt------>| - Add Δt to sync_cpu_timestamp to get adjusted_timestamp |neuron-profiler2-syncpoint-timeline| **Hardware Events**: Hardware events that require timestamp adjustment include: - ``nc_exec_running`` (NeuronCore execution start/stop) - ``cc_running`` (collective communication execution) - ``cc_exec_barrier`` (collective communication barriers) - ``numerical_err`` (numerical errors) - ``nc_model_switch`` (NeuronCore model switching) Tips ^^^^ 1. **Memory Optimization**: Use NeuronCore filtering to avoid allocating ring buffers for unused cores and decrease host memory usage. Use both event type or NeuronCore to decrease output trace sizes. 2. **Event Type Discovery**: Use ``nrt_sys_trace_get_event_types()`` to discover available event types 3. **Category Filtering**: Use ``hardware``/``software`` categories for broad filtering 4. **Exclusion Filtering**: Use ``^`` prefix to exclude specific events from categories 5. **Combine Filters**: Use both NeuronCore and event type filters together for maximum optimization Processing-Time Filtering ~~~~~~~~~~~~~~~~~~~~~~~~~~ Apply filters when viewing or processing already captured profiles. This approach allows you to analyze the same trace data in different ways without recapturing. The filters can be used for any ``neuron-profile`` output format including ``--output-format json`` and ``--output-format perfetto``. Filtering by NeuronCore ^^^^^^^^^^^^^^^^^^^^^^^ Use the ``--system-trace-filter-neuron-core`` to only process events for specific NeuronCores. The IDs are local to the instance and not global IDs. If the ``--system-trace-filter-neuron-core`` argument is not set then events from all NeuronCores will be included in the processed trace. .. code-block:: shell # Filter by single neuron core neuron-profile view -d ./output --system-trace-filter-neuron-core "0" --output-format perfetto # Filter by multiple neuron cores neuron-profile view -d ./output --system-trace-filter-neuron-core "0,1,2,3" --output-format perfetto Filtering by Event Type ^^^^^^^^^^^^^^^^^^^^^^^ Use the ``--system-trace-filter-event-type`` to only process specific trace events types. If the ``--system-trace-filter-event-type`` argument is not set then all event types will be included in the processed trace. .. code-block:: shell # Filter by single event type neuron-profile view -d ./output --system-trace-filter-event-type "nrt_execute" --output-format perfetto # Filter by multiple event types neuron-profile view -d ./output --system-trace-filter-event-type "nrt_execute,nrt_load" --output-format perfetto Filtering by Instance ID ^^^^^^^^^^^^^^^^^^^^^^^^ Use the ``--system-trace-filter-instance-id`` to only process events for specific ec2 instances. If the ``--system-trace-filter-instance-id`` argument is not set then events from all instances will be included in the processed trace. .. code-block:: shell # Filter by single instance neuron-profile view -d ./output --system-trace-filter-instance-id "i-abc123" --output-format perfetto # Filter by multiple instances (comma-separated) neuron-profile view -d ./output --system-trace-filter-instance-id "i-abc123,i-def456,i-ghi789" --output-format perfetto Troubleshooting --------------- Incomplete JAX Profiles ~~~~~~~~~~~~~~~~~~~~~~~ If your JAX profile has fewer events than expected or lacks the Runtime API trace, check whether ``jax.profiler.stop_trace`` is being called inside a ``with jax.profiler.trace`` context block. This can prematurely stop tracing. Use ``jax.profiler.stop_trace`` only when profiling was started with ``jax.profiler.start_trace``, not when using the context-managed ``with jax.profiler.trace`` API. Also when using ``jax.profiler`` within your script ensure that the environment variable ``NEURON_RT_INSPECT_ENABLE`` is not set to 1. Additionally, ensure that ``NEURON_RT_INSPECT_OUTPUT_DIR`` is set to the correct output directory and this is the output directory passed to ``with jax.profiler.trace``. Dropped Events in System Profile ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When processing a system profile, you may see a warning indicating that some trace events were dropped during capture. .. code-block:: shell WARN[0000] Warning: 1001 trace events were dropped during capture (stored 530560 out of 531561 total events). Consider increasing buffer size, reducing trace duration, or filtering events. This means during capture the trace event buffers filled and oldest events were overwritten. If you need to avoid dropping events for the full duration of your workload consider the following adjustments: * Increase buffer size by setting ``NEURON_RT_INSPECT_SYS_TRACE_MAX_EVENTS_PER_NC`` (see :ref:`Profile Capture Environment Variables `). This will increase host memory usage. * Apply capture-time filters (NeuronCores / event types) (see :ref:`Filtering System Profiles `.) * Shorten profiled region: limit the code span under the profiling context / runtime. .. |neuron-profiler2-annotate-system-ui| image:: /images/neuron-profiler2-annotate-system-ui.png .. |neuron-profiler2-attributes-window| image:: /images/neuron-profiler2-attributes-window.png .. |neuron-profiler2-device-profile-list| image:: /images/neuron-profiler2-device-profile-list.png .. |neuron-profiler2-drilldown-device| image:: /images/neuron-profiler2-drilldown-device.png .. |neuron-profiler2-perfetto-timeline| image:: /images/neuron-profiler2-perfetto-timeline.png .. |neuron-profiler2-perfetto-device-timeline| image:: /images/neuron-profiler2-perfetto-device-timeline.png .. |neuron-profiler2-perfetto-grouping| image:: /images/neuron-profiler2-perfetto-grouping.png .. |neuron-profiler2-syncpoint-timeline| image:: /images/neuron-profiler2-syncpoint-timeline.png .. |neuron-profiler2-perfetto-gid| image:: /images/neuron-profiler2-perfetto-gid.png