Troubleshooting Guide for MXNet-Neuron

Inference Runtime Error

Out-of-memory error when calling Symbol API bind() too many times

If you see out-of-memory error when using Symbol API’s bind() function, please ensure that the bind() function is called once for each desired model instance. For example, on inf1.xlarge, use Symbol API to create 4 parallel instances of a model that was compiled to 1 NeuronCore (–neuroncore-pipeline-cores=1), each is bound to an different mx.neuron(i) context where i is the NeuronCore Group index ranging from 0 to 3. Then use 4 threads to feed the 4 instances in parallel. For example:

os.environ['NEURONCORE_GROUP_SIZES'] = ','.join('1' for _ in range(NUM_PARALLEL))

data_iter = []
for i in range(NUM_PARALLEL):
        path_imgrec=recfile_base, data_shape=(3, 224, 224), batch_size=1,
        num_parts=NUM_PARALLEL, part_index=i))

sym, args, auxs = mx.model.load_checkpoint('resnet-50_compiled', 0)

exec_list = []
for i in range(NUM_PARALLEL):
    exec = sym.bind(ctx=mx.neuron(i), args=args, aux_states=auxs, grad_req='null')

def single_thread_infer(i):
    for batch in data_iter[i]:
        img =[0]
        label = batch.label
        feed_dict = {'data': img}
        exe = exec_list[i]
        out = exe.outputs[0]

future_list = []
with futures.ThreadPoolExecutor(max_workers=NUM_PARALLEL) as executor:
    for i in range(NUM_PARALLEL):
        future_list.append(executor.submit(single_thread_infer, i))

Inference crashed with MXNetError: InferShapeKeyword argument name xyz not found

If you see MXNetError:

mxnet.base.MXNetError: [11:55:39] src/c_api/ InferShapeKeyword argument name xyz not found."

This is followed by a list of “Candidate arguments”. This list shows all the input argument names that the model knows about, and ‘xyz’ is not in the list. To fix this, remove entry xyz from the feed dictionary.

Inference crashed at mx.nd.waitall() with MXNetError: Check failed: bin.dtype() == mshadow::kUint8

When executing Symbol API’s forward function followed by mx.nd.waitall(), where MXNetError exception occurs with ‘Check failed: bin.dtype() == mshadow::kUint8’.

Inference crashed with NRTD error 1002

During inference, the user may encounter an error with details “[NRTD:infer_wait] error: 1002”:

mxnet.base.MXNetError: [11:26:56] src/operator/subgraph/neuron/./neuron_util.h:1175: Check failed: rsp_wait.status().code() == 0 || rsp_wait.status().code() == 1003: Failed
Infer Wait with Neuron-RTD Error. Neuron-RTD Status Code: 1002, details: "[NRTD:infer_wait] error: 1002

Runtime errors are listed in Neuron Runtime return codes. In particular, 1002 means that some invalid input has been submitted to infer, e.g. missing some of the input tensors, incorrect input tensor sizes. Please examine /var/log/syslog to see imore details on the error. For example, you may see:

Oct 30 19:13:39 ip-172-31-93-131 nrtd[1125]: [TDRV:io_queue_prepare_input_nonhugetlb] Unexpected input size, for data00, expected: 2097152, received: 33554432

This means that the input tensor size is larger than what the model was compiled for (i.e. the example input tensor shapes passed during compilation.

Multi-Model Server

Failed to create NEURONCORE Group with GRPC Error. Status Error: 14, Error message: “Connect Failed”

If the client is unable to start workers and you get a message that MMS is unable to create NeuronCore Group, please check that Neuron RTD is running (neuron-rtd process).

"code": 500,
"type": "InternalServerException",
"message": "Failed to start workers“
2019-10-23 19:56:23,187 [INFO ] W-9000-squeezenet_v1.1_compiled-stdout - [19:56:23] src/operator/subgraph/inferentia/./inferentia_util.h:218: Check failed: status.ok() Failed to create NeuronCore Group with GRPC Error. Status Error: 14, Error message: "Connect Failed"

Multiple MMS workers die with “Backend worker process die.” message

If you run inference with MMS and get multiple messages “Backend worker process die”, please ensure that the number of workers (“intial_workers”) passed during load model is less than or equal to number of NeuronCores available divided by number of NeuronCores required by model. - Backend worker process die. - Traceback (most recent call last): - File "/usr/local/lib/python3.6/site-packages/mxnet/symbol/", line 1524, in simple_bind - ctypes.byref(exe_handle))) - File "/usr/local/lib/python3.6/site-packages/mxnet/", line 252, in check_call - raise MXNetError(py_str(_LIB.MXGetLastError())) - mxnet.base.MXNetError: [00:26:32] src/operator/subgraph/neuron/./neuron_util.h:221: Check failed: 0 == create_eg_rsp.status().code() Failed to create NeuronCore Group with KRTD Error. KRTD Status Code: 4, details: ""

As indicated in Application Note: Performance Tuning, for greater flexibility user can use NEURONCORE_GROUP_SIZES to specify the groupings of NeuronCores into Neuron devices, each device consisting of one or more NeuronCores. Each worker would take a device. The total number of NeuronCores taken by all the workers should be less than or equal the total number of NeuronCores visible to neuron-rtd. This situation should be considered at full load (MMS scales up to max_workers). Additionally, to properly assign model to Neuron device, the environment NEURONCORE_GROUP_SIZES must be specified within the model server class (ie. in the example above). For example, add the following line within for model compiled to 1 NeuronCore:

os.environ['NEURONCORE_GROUP_SIZES'] = '1'

More information about max_worker limit setting can be found at MMS Management API Documentation. For example, to run up to 4 workers in inf1.xlarge where 4 NeuronCores are available by default to Neuron-RTD, set max_workers to 4:

curl -v -X PUT "http://localhost:8081/models/squeezenet_v1.1_compiled?min_worker=1?max_worker=4"

MMS throws a “mxnet.base.MXNetError: array::at” error

If you see “mxnet.base.MXNetError: array::at” when running MMS please check that NDArray/Gluon API is not used as they are not supported in MXNet-Neuron.

[INFO ] W-9000-squeezenet_v1.1_compiled-stdout - array::at
[INFO ] W-9000-squeezenet_v1.1_compiled - Backend response time: 30
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout - Traceback (most recent call last):
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -   File "/tmp/models/6606fa046f68a34df87f15362a7a2d9a49749878/", line 82, in handle
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -     data = self.inference(data)
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -   File "/tmp/models/6606fa046f68a34df87f15362a7a2d9a49749878/", line 153, in inference
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -     d.wait_to_read()
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -   File "/home/user/regression_venv_p3.6/lib/python3.6/site-packages/mxnet/ndarray/", line 1819, in wait_to_read
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -     check_call(_LIB.MXNDArrayWaitToRead(self.handle))
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -   File "/home/user/regression_venv_p3.6/lib/python3.6/site-packages/mxnet/", line 253, in check_call
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout -     raise MXNetError(py_str(_LIB.MXGetLastError()))
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout - mxnet.base.MXNetError: array::at
[INFO ] W-9000-squeezenet_v1.1_compiled-stdout - Invoking custom service failed.

MXNet Model Server is not able to clean up Neuron RTD states after model is unloaded

NOTE: this issue is resolved in version released 11/17/2020.

MXNet Model Server is not able to clean up Neuron RTD states after model is unloaded (deleted) from model server. Restarting the model server may fail with “Failed to create NEURONCORE_GROUP” error:

mxnet.base.MXNetError: [00:26:59] src/operator/subgraph/neuron/./neuron_util.h:348: Check failed:    0 == create_eg_rsp.status().code(): Failed to create NEURONCORE_GROUP with Neuron-RTD Error. Neuron-RTD Status Code: 9, details: ""

The workaround is to run “/opt/aws/neuron/bin/neuron-cli reset“ to clear Neuron RTD states after all models are unloaded and server is shut down before restarting the model server.