This document is relevant for: Inf2, Trn1, Trn2, Trn3

How to migrate from implicit async mode to explicit async APIs#

Task overview#

This topic discusses how to migrate from the legacy implicit async execution mode (NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS) to the explicit async APIs (nrta_*) using the AWS Neuron SDK. The explicit async APIs provide fine-grained control over asynchronous execution, enabling higher device utilization through independent scheduling of compute, communication, and data transfer operations.

Prerequisites#

  • Neuron SDK version: Neuron SDK 2.29 or later with explicit async API support.

  • Familiarity with the synchronous NRT APIs: You should already have a working application using nrt_execute.

  • Header inclusion: Ensure your project includes nrt/nrt_async.h.

Note

If you are using Neuron exclusively through a framework (PyTorch Neuron, JAX Neuron, TensorFlow Neuron, etc.), the only action needed is to update the framework to a version that supports the explicit async APIs. The framework handles the runtime interaction on your behalf, and no application-level code changes are required.

Instructions#

1: Remove the implicit async environment variable

The legacy implicit async mode is controlled by the environment variable NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS. Remove it from your environment, launch scripts, and configuration files.

# Remove from your environment
unset NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS

With the explicit async APIs, inflight request depth is managed directly by your application logic rather than a global environment variable.

2: Replace nrt_execute with nrta_execute_schedule

The implicit mode made nrt_execute non-blocking. With explicit async, use nrta_execute_schedule which returns immediately and provides a sequence number for tracking completion.

// Before (implicit async): nrt_execute blocks or returns immediately depending on env var
NRT_STATUS ret = nrt_execute(model, input_set, output_set);

// After (explicit async): always non-blocking, returns a sequence number
NRT_STATUS exec_ret;
nrta_seq_t seq;
NRT_STATUS ret = nrta_execute_schedule(model, input_set, output_set, 0, &exec_ret, &seq);
if (ret == NRT_QUEUE_FULL) {
    // Queue is full — wait for completions before retrying
}

Note

The exec_ret parameter is populated with the execution result after the request completes on the device. You must retain this variable until completion is confirmed.

3: Add explicit completion tracking

Replace any implicit synchronization (which previously happened inside nrt_tensor_read/nrt_tensor_write or at queue-full boundaries) with explicit completion checks.

Option A: Polling

// Poll until a specific request completes
bool is_completed = false;
while (!is_completed) {
    nrta_is_completed(seq, &is_completed);
    usleep(1);
}

// Or check the last completed sequence on an execution unit
nrta_seq_t completed_seq;
nrta_get_sequence(lnc, NRTA_XU_COMPUTE, 0, &completed_seq);
if (completed_seq >= last_submitted_seq) {
    // All submitted work is done
}

Option B: Event-based (recommended for production)

#include <sys/eventfd.h>
#include <poll.h>

// Register an eventfd to be signaled on completion
int efd = eventfd(0, EFD_NONBLOCK);
nrta_event_register_seq_id_completion(seq, efd);

// Wait for signal via poll/epoll/select
struct pollfd pfd = { .fd = efd, .events = POLLIN };
poll(&pfd, 1, timeout_ms);

close(efd);

4: Implement per-request error handling

In implicit async mode, errors surfaced at the next blocking call. With explicit async, each scheduled request has its own NRT_STATUS* that is written upon completion.

static const int NUM_REQUESTS = 8;
NRT_STATUS exec_rets[NUM_REQUESTS];
nrta_seq_t req_seqs[NUM_REQUESTS];

// Submit multiple requests
for (int i = 0; i < NUM_REQUESTS; i++) {
    NRT_STATUS ret = nrta_execute_schedule(model, inputs, outputs, 0,
                                           &exec_rets[i], &req_seqs[i]);
    if (ret != NRT_SUCCESS) break;
}

// check completion
...

// After completion, check each request's status
for (int i = 0; i < NUM_REQUESTS; i++) {
    if (exec_rets[i] != NRT_SUCCESS) {
        fprintf(stderr, "Request %d failed: %d\n", i, exec_rets[i]);
    }
}

Note

If any schedule call returns NRT_EXEC_UNIT_UNRECOVERABLE, the execution unit has entered a fatal state. The application must reinitialize the runtime (typically by restarting the process).

5: Handle queue backpressure (NRT_QUEUE_FULL)

The implicit mode used NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS to cap concurrency. With explicit async, the runtime signals backpressure by returning NRT_QUEUE_FULL from nrta_execute_schedule when the execution unit’s queue cannot accept more requests. When this happens, your application must wait for at least one in-flight request to complete, then retry the schedule call.

NRT_STATUS exec_ret;
nrta_seq_t seq;
NRT_STATUS ret = nrta_execute_schedule(model, inputs, outputs, 0, &exec_ret, &seq);

if (ret == NRT_QUEUE_FULL) {
    // Wait for at least one completion before retrying
    nrta_seq_t completed;
    do {
        nrta_get_sequence(lnc, NRTA_XU_COMPUTE, 0, &completed);
    } while (completed < last_known_completed + 1);

    // Retry the schedule
    ret = nrta_execute_schedule(model, inputs, outputs, 0, &exec_ret, &seq);
}

Confirm your work#

To confirm you have successfully migrated to the explicit async APIs:

  1. Verify the environment variable is no longer set:

echo $NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS
# Should be empty
  1. Run your application and confirm that nrta_execute_schedule returns NRT_SUCCESS and sequence numbers increment monotonically.

  2. Verify that completion tracking works by checking that nrta_is_completed returns true after execution finishes, and that per-request NRT_STATUS values are NRT_SUCCESS.

Common issues#

Uh oh! Did you encounter an error or other issue while working through this task? Here are some commonly encountered issues and how to address them.

NRT_QUEUE_FULL returned from schedule calls

  • Possible solution: Your application is submitting work faster than the device can process it. Implement backpressure by waiting for completions (via polling or events) before submitting new requests. Reduce your inflight request cap.

Stale data read from tensors

  • Possible solution: You are reading tensor data before the execution that produces it has completed. Use nrta_is_completed or nrta_get_sequence to confirm the producing execution has finished before scheduling a tensor read.

NRT_EXEC_UNIT_UNRECOVERABLE

  • Possible solution: The execution unit has entered a fatal state due to a hardware error or timeout. Terminate and relaunch your application. In severe cases, the Neuron driver may need to be reloaded.

exec_ret status is not populated

  • Possible solution: The NRT_STATUS* passed to schedule calls is only written upon request completion, not at schedule time. Ensure you are checking the status only after confirming completion via nrta_is_completed, nrta_get_sequence, or an event notification.