This document is relevant for: Trn1

How to debug models in PyTorch Neuron (torch-neuronx)#

Torch-XLA evaluates operations lazily, which means it builds a symbolic graph in the background and the graph is executed in hardware only when the users request (print) for the output or a mark_step is encountered. To effectively debug training scripts with torch-xla, please use one of the approaches mentioned below:

Printing Metrics#

Torch-xla provides a utility that records metrics of different sections of the code. These metrics can help figure out things like: How much time is spent in compilation? How much time is spent in execution? To check the metrics:

  1. Import metrics: import torch_xla.debug.metrics as met

  2. Print metrics at the end of the step: print(met.metrics_report())

Printing metrics should produce an output that looks like this:

Metric: CompileTime
  TotalSamples: 1
  Accumulator: 09s969ms486.408us
  Percentiles: 1%=09s969ms486.408us; 5%=09s969ms486.408us; 10%=09s969ms486.408us; 20%=09s969ms486.408us; 50%=09s969ms486.408us; 80%=09s969ms486.408us; 90%=09s969ms486.408us; 95%=09s969ms486.408us; 99%=09s969ms486.408us
.....
Metric: ExecuteTime
  TotalSamples: 1
  Accumulator: 186ms062.970us
  Percentiles: 1%=186ms062.970us; 5%=186ms062.970us; 10%=186ms062.970us; 20%=186ms062.970us; 50%=186ms062.970us; 80%=186ms062.970us; 90%=186ms062.970us; 95%=186ms062.970us; 99%=186ms062.970us
....
Metric: TensorsGraphSize
  TotalSamples: 1
  Accumulator: 9.00
  Percentiles: 1%=9.00; 5%=9.00; 10%=9.00; 20%=9.00; 50%=9.00; 80%=9.00; 90%=9.00; 95%=9.00; 99%=9.00
Metric: TransferFromServerTime
  TotalSamples: 2
  Accumulator: 010ms130.597us
  ValueRate: 549ms937.108us / second
  Rate: 108.372 / second
  Percentiles: 1%=004ms948.602us; 5%=004ms948.602us; 10%=004ms948.602us; 20%=004ms948.602us; 50%=006ms181.995us; 80%=006ms181.995us; 90%=006ms181.995us; 95%=006ms181.995us; 99%=006ms181.995us
Metric: TransferToServerTime
  TotalSamples: 6
  Accumulator: 061ms698.791us
  ValueRate: 007ms731.182us / second
  Rate: 0.665369 / second
  Percentiles: 1%=006ms848.579us; 5%=006ms848.579us; 10%=006ms848.579us; 20%=007ms129.666us; 50%=008ms940.718us; 80%=008ms496.166us; 90%=024ms636.413us; 95%=024ms636.413us; 99%=024ms636.413us
Metric: TransferToServerTransformTime
  TotalSamples: 6
  Accumulator: 011ms835.717us
  ValueRate: 001ms200.844us / second
  Rate: 0.664936 / second
  Percentiles: 1%=108.403us; 5%=108.403us; 10%=108.403us; 20%=115.676us; 50%=167.399us; 80%=516.659us; 90%=010ms790.400us; 95%=010ms790.400us; 99%=010ms790.400us
.....
Counter: xla::_copy_from
  Value: 7
Counter: xla::addmm
  Value: 2
Counter: xla::empty
  Value: 5
Counter: xla::t
  Value: 2
....
Metric: XrtCompile
  TotalSamples: 1
  Accumulator: 09s946ms607.609us
  Mean: 09s946ms607.609us
  StdDev: 000.000us
  Percentiles: 25%=09s946ms607.609us; 50%=09s946ms607.609us; 80%=09s946ms607.609us; 90%=09s946ms607.609us; 95%=09s946ms607.609us; 99%=09s946ms607.609us
Metric: XrtExecute
  TotalSamples: 1
  Accumulator: 176ms932.067us
  Mean: 176ms932.067us
  StdDev: 000.000us
  Percentiles: 25%=176ms932.067us; 50%=176ms932.067us; 80%=176ms932.067us; 90%=176ms932.067us; 95%=176ms932.067us; 99%=176ms932.067us
Metric: XrtReadLiteral
  TotalSamples: 2
  Accumulator: 608.578us
  Mean: 304.289us
  StdDev: 067.464us
  Rate: 106.899 / second
  Percentiles: 25%=236.825us; 50%=371.753us; 80%=371.753us; 90%=371.753us; 95%=371.753us; 99%=371.753us

As seen, you can get useful information about graph compile times/execution times. You can also know which operators are present in the graph, which operators are run on the CPU and which operators are run on an XLA device. For example, operators that have a prefix aten:: would run on the CPU, since they do not have xla lowering. All operators with prefix xla:: would run on an XLA device. Note: aten operators that do not have xla lowering would result in a graph fragmentation and might end up slowing down the entire execution. If you encounter such operators, create a request for operator support.

Printing Tensors#

Users can print tensors in their script as below:

import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm

device = xm.xla_device()
input1 = torch.randn(2,10).to(device)
# Defining 2 linear layers
linear1 = torch.nn.Linear(10,30).to(device)
linear2 = torch.nn.Linear(30,20).to(device)

# Running forward
output1 = linear1(input1)
output2 = linear2(output1)
print(output2)

Since torch-xla evaluates operations lazily, when you try to print output2 , the graph associated with the tensor would be evaluated. When a graph is evaluated, it is first compiled for the device and executed on the selected device. Note: Each tensor would have a graph associated with it and can result in graph compilations and executions. For example, in the above script, if you try to print output1 , a new graph is cut and you would see another evaluation. To avoid multiple evaluations, you can make use of mark_step (next section).

Use mark_step#

Torch-XLA provides an api called mark_step which evaluates a graph collected upto that point. While this is similar to printing of an output tensor wherein a graph is also evaluated, there is a difference. When an output tensor is printed, only the graph associated with that specific tensor is evaluated, whereas mark_step enables all the output tensors up to mark_step call to be evaluated in a single graph. Hence, any tensor print after mark_step would be effectively free of cost as the tensor values are already evaluated. Consider the example below:

import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met

device = xm.xla_device()
input1 = torch.randn(2,10).to(device)
# Defining 2 linear layers
linear1 = torch.nn.Linear(10,30).to(device)
linear2 = torch.nn.Linear(30,20).to(device)

# Running forward
output1 = linear1(input1)
output2 = linear2(output1)
xm.mark_step()
print(output2)
print(output1)
# Printing the metrics to check if compilation and execution occurred
print(met.metrics_report())

In the printed metrics, the number of compiles and executions is only 1, even though 2 tensors are printed. Hence, to avoid multiple graph evaluations, it is recommended that you visualize tensors after a mark_step . You can also make use of the add_step_closure api for this purpose. With this api, you pass in the tensors that needs to be visualized/printed. The added tensors would then be preserved in the graph and can be printed as part of the callback function passed to the api. Here is a sample usage: https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py#L133

Note: Graph compilations can take time as the compiler optimizes the graph to run on device.

Using Eager Debug Mode#

Eager debug mode provides a convenient utility to step through the code and evaluate operators one by one for correctness. Eager debug mode is useful to inspect your models the way you would do in eager-mode frameworks like PyTorch and Tensorflow. With Eager Debug Mode operations are executed eagerly. As soon as an operation is registered with torch-xla, it would be sent for compilation and execution. Since compiling a single operation, the time spent would be minimal. Moreover, the chances of hitting the framework compilation cache increases as models would have repeated operations throughout. Consider example 1 below:

# Example 1

import os
# You need to set this env variable before importing torch-xla
# to run in eager debug mode.
os.environ["NEURON_USE_EAGER_DEBUG_MODE"] = "1"

import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met

device = xm.xla_device()
input1 = torch.randn(2,10).to(device)
# Defining 2 linear layers
linear1 = torch.nn.Linear(10,30).to(device)
linear2 = torch.nn.Linear(30,20).to(device)

# Running forward
output1 = linear1(input1)
output2 = linear2(output1)

# Printing the metrics to check if compilation and execution occurred
# Here, in the metrics you should notice that the XRTCompile and XRTExecute
# value is non-zero, even though no tensor is printed. This is because, each
# operation is executed eagerly.
print(met.metrics_report())

print(output2)
print(output1)
# Printing the metrics to check if compilation and execution occurred.
# Here the XRTCompile count should be same as the previous count.
# In other words, printing tensors did not incur any extra compile
# and execution of the graph
print(met.metrics_report())

As seen from the above scripts, each operator is evaluated eagerly and there is no extra compilation when output tensors are printed. Moreover, together with the on-disk Neuron persistent cache, eager debug mode only incurs one time compilation cost when the ops is first run. When the script is run again, the compiled ops will be pulled from the persistent cache. Any changes you make to the training script would result in the re-compilation of only the newly inserted operations. This is because each operation is compiled independently. Consider example 2 below:

# Example 2

import os
# You need to set this env variable before importing torch-xla
# to run in eager debug mode.
os.environ["NEURON_USE_EAGER_DEBUG_MODE"] = "1"

import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met

os.environ['NEURON_CC_FLAGS'] = "--log_level=INFO"

device = xm.xla_device()
input1 = torch.randn(2,10).to(device)
# Defining 2 linear layers
linear1 = torch.nn.Linear(10,30).to(device)
linear2 = torch.nn.Linear(30,20).to(device)
linear3 = torch.nn.Linear(20,30).to(device)
linear4 = torch.nn.Linear(30,20).to(device)

# Running forward
output1 = linear1(input1)
output2 = linear2(output1)
output3 = linear3(output2)

# Note the number of compiles at this point and compare
# with the compiles in the next metrics print
print(met.metrics_report())

output4 = linear4(output3)
print(met.metrics_report())

Running the above example 2 script after running example 1 script, you may notice that from the start until the statement output2 = linear2(output1) , all the graphs would hit the persistent cache. Executing the line output3 = linear3(output2) would result in a new compilation for linear3 layer only because the layer configuration is new. Now, when we run output4 = linear4(output3) , you would observe no new compilation happens. This is because the graph for linear4 is same as the graph for linear2 and hence the compiled graph for linear2 is reused for linear4 by the framework’s internal cache.

Eager debug mode avoids the wait times involved with tensor printing because of larger graph compilation. It is designed only for debugging purposes, so when the training script is ready, please remove the NEURON_USE_EAGER_DEBUG_MODE environment variable from the script in order to obtain optimal performance.

By default, in eager debug mode the logging level in the Neuron compiler is set to error mode. Hence, no logs would be generated unless there is an error. Before your first print, if there are many operations that needs to be compiled, there might be a small delay. In case you want to check the logs, switch on the INFO logs for compiler using:

os.environ["NEURON_CC_FLAGS"] = "--log_level=INFO"

Profiling Model Run#

Profiling model run can help to identify different bottlenecks and resolve issues faster. You can profile different sections of the code to see which block is the slowest. To profile model run, you can follow the steps below:

  1. Add: import torch_xla.debug.profiler as xp

  2. Start server. This can be done by adding the following line after creating xla device: server = xp.start_server(9012)

  3. In a separate terminal, start tensorboard. The logdir should be in the same directory from which you run the script.

    Image: tensorboard.png

    Open the tensorboard on a browser. Go to profile section in the top right. Note: you may have to install the profile plugin using: pip install tensorboard-plugin-profile

  4. When you click on the profile, it should give an option to capture profile. Clicking on capture profile produces the following pop-up.

    Image: popup.png

    In the URL enter: localhost:9012 . Port in this URL should be same as the one you gave when starting the server in the script.

  5. Once done, click capture and it should automatically load the following page:

    Image: tb_1.png
  6. To check the profile for different blocks of code, head to trace_viewer under Tools (on the left column).

    Image: options.png
  7. It should show a profile that looks like this:

    Image: profile_large.png

Note: By default, torch-xla would time different blocks of code inside the library. However, you can also profile block of code in your scripts. This can be done by adding the code within a xp.Trace context as follows:

....
for epoch in range(total_epochs):
    inputs = torch.randn(1,10).to(device)
    labels = torch.tensor([1]).to(device)
    with xp.Trace("model_build"):
        loss = model(inputs, labels)
    with xp.Trace("loss_backward"):
        loss.backward()
....

It should produce a profile that has the model_build and loss_backward section timed. This way you can time any block of script for debugging.

Image: Screen profile_zoom.png

Note: If you are running your training script in a docker container, to view the tensorboard, you should launch the docker container using flag: --network host eg. docker run --network host my_image:my_tag

Snapshotting#

Snapshotting models can be used to dump debug information that can then be sent to the Neuron team. Neuron execution relies on a series of compiled graphs. Internally, graph HLOs are used as an intermediate representation which is then compiled. Then, during execution, the graph inputs are passed to the Neuron runtime, which produces outputs using the compiled graph. Snapshotting saves the inputs to a graph execution, executes the graphs, saves the outputs of the execution, and then bundles and dumps the inputs, outputs and graph HLO in one file. This is illustrated here:

Image: snapshot-diagram.png

This feature can be enabled using the following environment variables, which can be set at the beginning of your script as follows.

....
os.environ["XLA_FLAGS"] = " --xla_dump_to=dump"
os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "1"
....

This set of environment variables will produce snapshots under the dump folder with the extensions .hlo.snapshot.pb or .decomposed_hlo_snapshot at every iteration. For example a file that looks like the following would be produced.

dump/module_SyncTensorsGraph.387.pid_12643.execution_7496.hlo_snapshot.pb

The dumping environment variable can be set and unset at specific iterations as shown in the following example.

....
for step in range(STEPS):
    if step == 20:
        os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "1"
    else:
        os.environ.pop('NEURON_DUMP_HLO_SNAPSHOT', None)
    train_x = torch.randn(BATCH_SIZE, 28, 28)
    train_x = train_x.to(device)
    loss = model(train_x)
    loss.backward()
    optimizer.step()
    xm.mark_step()
....

Additionally, we provide capabilities to snapshot graphs automatically. The environment variables above can be set as follows:

....
os.environ["XLA_FLAGS"] = " --xla_dump_to=dump"
os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "ON_NRT_ERROR"
....

When unexpected errors such as a graph execution producing NaNs occurs, snapshots will be automatically produced and execution will be terminated. Occasionally, for larger models, automatic snapshotting may not capture snapshots due to the device memory being exhausted. In this case, the above flag can be set to os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "ON_NRT_ERROR_HYBRID", this will allocate memory for inputs on both the device and host memory. In some additional cases, this may still go out of memory and may need to be set to os.environ["NEURON_DUMP_HLO_SNAPSHOT"] = "ON_NRT_ERROR_CPU" to avoid allocating any memory on the device at all for automatic snapshotting.

Snapshot FAQs:#

When should I use this features?

This feature should be used when debugging errors that requires interfacing with and providing debug data to the Neuron team. Snapshotting may be redundant and unnecessary in some situations. For example, when only the model weights are necessary for debugging, methods such as checkpointing may be more convenient to use.

What sort of data is captured with these snapshots?

The type of data captured by these snapshots may include model graphs in HLO form, weights/parameters, optimizer states, intermediate tensors and gradients. This data may be considered sensitive and this should be taken into account before sending the data to the Neuron team.

What is the size of these snapshots?

The size of snapshots can be significant for larger models such as GPT or BERT with several GBs worth of data for larger graphs, so it is recommended to check that sufficient disk space exists before using snapshotting. In addition, limiting the amount of snapshots taken in a run will help to preserve disk space.

Will snapshotting add overhead to my execution?

Snapshotting does add a small overhead to the execution in most cases. This overhead can be significant if snapshots are dumped at every iteration. In order to alleviate some of this overhead, in the case that snapshotting is not necessary on all cores the following environment variable can be set to collect snapshots only on the first core. In addition, checkpointing in tandem with snapshotting can be useful to reduce overhead. A checkpoint close to the problem iteration can be captured, then execution resumed with snapshots enabled.

....
os.environ["NEURON_NC0_ONLY_SNAPSHOT"] = "1"
....

How can I share snapshots with the Neuron team?

These snapshots can be shared with the Neuron team via S3 bucket.

This document is relevant for: Trn1