This document is relevant for: Inf2, Trn1, Trn2, Trn3
How to migrate from implicit async mode to explicit async APIs#
Task overview#
This topic discusses how to migrate from the legacy implicit async execution mode
(NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS) to the explicit async APIs (nrta_*)
using the AWS Neuron SDK. The explicit async APIs provide fine-grained control over
asynchronous execution, enabling higher device utilization through independent scheduling
of compute, communication, and data transfer operations.
Prerequisites#
Neuron SDK version: Neuron SDK 2.29 or later with explicit async API support.
Familiarity with the synchronous NRT APIs: You should already have a working application using
nrt_execute.Header inclusion: Ensure your project includes
nrt/nrt_async.h.
Note
If you are using Neuron exclusively through a framework (PyTorch Neuron, JAX Neuron, TensorFlow Neuron, etc.), the only action needed is to update the framework to a version that supports the explicit async APIs. The framework handles the runtime interaction on your behalf, and no application-level code changes are required.
Instructions#
1: Remove the implicit async environment variable
The legacy implicit async mode is controlled by the environment variable
NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS. Remove it from your environment,
launch scripts, and configuration files.
# Remove from your environment
unset NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS
With the explicit async APIs, inflight request depth is managed directly by your application logic rather than a global environment variable.
2: Replace nrt_execute with nrta_execute_schedule
The implicit mode made nrt_execute non-blocking. With explicit async, use
nrta_execute_schedule which returns immediately and provides a sequence number
for tracking completion.
// Before (implicit async): nrt_execute blocks or returns immediately depending on env var
NRT_STATUS ret = nrt_execute(model, input_set, output_set);
// After (explicit async): always non-blocking, returns a sequence number
NRT_STATUS exec_ret;
nrta_seq_t seq;
NRT_STATUS ret = nrta_execute_schedule(model, input_set, output_set, 0, &exec_ret, &seq);
if (ret == NRT_QUEUE_FULL) {
// Queue is full — wait for completions before retrying
}
Note
The exec_ret parameter is populated with the execution result after the request
completes on the device. You must retain this variable until completion is confirmed.
3: Add explicit completion tracking
Replace any implicit synchronization (which previously happened inside nrt_tensor_read/nrt_tensor_write
or at queue-full boundaries) with explicit completion checks.
Option A: Polling
// Poll until a specific request completes
bool is_completed = false;
while (!is_completed) {
nrta_is_completed(seq, &is_completed);
usleep(1);
}
// Or check the last completed sequence on an execution unit
nrta_seq_t completed_seq;
nrta_get_sequence(lnc, NRTA_XU_COMPUTE, 0, &completed_seq);
if (completed_seq >= last_submitted_seq) {
// All submitted work is done
}
Option B: Event-based (recommended for production)
#include <sys/eventfd.h>
#include <poll.h>
// Register an eventfd to be signaled on completion
int efd = eventfd(0, EFD_NONBLOCK);
nrta_event_register_seq_id_completion(seq, efd);
// Wait for signal via poll/epoll/select
struct pollfd pfd = { .fd = efd, .events = POLLIN };
poll(&pfd, 1, timeout_ms);
close(efd);
4: Implement per-request error handling
In implicit async mode, errors surfaced at the next blocking call. With explicit async,
each scheduled request has its own NRT_STATUS* that is written upon completion.
static const int NUM_REQUESTS = 8;
NRT_STATUS exec_rets[NUM_REQUESTS];
nrta_seq_t req_seqs[NUM_REQUESTS];
// Submit multiple requests
for (int i = 0; i < NUM_REQUESTS; i++) {
NRT_STATUS ret = nrta_execute_schedule(model, inputs, outputs, 0,
&exec_rets[i], &req_seqs[i]);
if (ret != NRT_SUCCESS) break;
}
// check completion
...
// After completion, check each request's status
for (int i = 0; i < NUM_REQUESTS; i++) {
if (exec_rets[i] != NRT_SUCCESS) {
fprintf(stderr, "Request %d failed: %d\n", i, exec_rets[i]);
}
}
Note
If any schedule call returns NRT_EXEC_UNIT_UNRECOVERABLE, the execution unit has
entered a fatal state. The application must reinitialize the runtime (typically by
restarting the process).
5: Handle queue backpressure (NRT_QUEUE_FULL)
The implicit mode used NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS to cap concurrency.
With explicit async, the runtime signals backpressure by returning NRT_QUEUE_FULL from
nrta_execute_schedule when the execution unit’s queue cannot accept more requests.
When this happens, your application must wait for at least one in-flight request to
complete, then retry the schedule call.
NRT_STATUS exec_ret;
nrta_seq_t seq;
NRT_STATUS ret = nrta_execute_schedule(model, inputs, outputs, 0, &exec_ret, &seq);
if (ret == NRT_QUEUE_FULL) {
// Wait for at least one completion before retrying
nrta_seq_t completed;
do {
nrta_get_sequence(lnc, NRTA_XU_COMPUTE, 0, &completed);
} while (completed < last_known_completed + 1);
// Retry the schedule
ret = nrta_execute_schedule(model, inputs, outputs, 0, &exec_ret, &seq);
}
Confirm your work#
To confirm you have successfully migrated to the explicit async APIs:
Verify the environment variable is no longer set:
echo $NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS
# Should be empty
Run your application and confirm that
nrta_execute_schedulereturnsNRT_SUCCESSand sequence numbers increment monotonically.Verify that completion tracking works by checking that
nrta_is_completedreturnstrueafter execution finishes, and that per-requestNRT_STATUSvalues areNRT_SUCCESS.
Common issues#
Uh oh! Did you encounter an error or other issue while working through this task? Here are some commonly encountered issues and how to address them.
NRT_QUEUE_FULL returned from schedule calls
Possible solution: Your application is submitting work faster than the device can process it. Implement backpressure by waiting for completions (via polling or events) before submitting new requests. Reduce your inflight request cap.
Stale data read from tensors
Possible solution: You are reading tensor data before the execution that produces it has completed. Use
nrta_is_completedornrta_get_sequenceto confirm the producing execution has finished before scheduling a tensor read.
NRT_EXEC_UNIT_UNRECOVERABLE
Possible solution: The execution unit has entered a fatal state due to a hardware error or timeout. Terminate and relaunch your application. In severe cases, the Neuron driver may need to be reloaded.
exec_ret status is not populated
Possible solution: The
NRT_STATUS*passed to schedule calls is only written upon request completion, not at schedule time. Ensure you are checking the status only after confirming completion vianrta_is_completed,nrta_get_sequence, or an event notification.