Neuron Runtime Troubleshooting on Inf1
This document is relevant for:
Neuron Runtime Troubleshooting on Inf1#
This document aims to provide more information on how to fix issues you might encounter while using the Neuron Runtime 2.x or above. For each issue we will provide an explanation of what happened and what can potentially correct the issue.
Table of contents
aws-neuron-dkms is a driver package which needs to be compiled during
installation. The compilation requires kernel headers for the instance’s
uname -r can be used to find kernel version in the instance.
In some cases, the installed kernel headers might be newer than the
instance’s kernel itself.
Please look at the aws-neuron-dkms installation log for message like the following:
Building for 4.14.193-149.317.amzn2.x86_64 Module build for kernel 4.14.193-149.317.amzn2.x86_64 was skipped since the kernel headers for this kernel does not seem to be installed.
If installation log is not available, check whether the module is loaded.
$ lsmod | grep neuron
If the above has no output then that means
installation is failed.
Stop all applications using the NeuronCores.
sudo apt remove aws-neuron-dkmsor
sudo yum remove aws-neuron-dkms
Install kernel headers for the current kernel
sudo apt install -y linux-headers-$(uname -r)or
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo apt install aws-neuron-dkmsor
sudo yum install aws-neuron-dkms
Neuron Runtime requires Neuron Driver(aws-neuron-dkms package) to access Neuron devices. If the driver is not installed then Neuron Runtime wont able to access the Neuron devices and will fail with an error message in console and syslog.
aws-neuron-dkms is not installed then the error message will be like the following:
2021-Aug-11 18:38:27.0917 13713:13713 ERROR NRT:nrt_init Unable to determine Neuron Driver version. Please check aws-neuron-dkms package is installed.
aws-neuron-dkms is installed but does not support the latest runtime then the error message will be like the following:
2021-Aug-11 19:18:21.0661 24616:24616 ERROR NRT:nrt_init This runtime requires Neuron Driver version 2.0 or greater. Please upgrade aws-neuron-dkms package.
When using any supported framework from Neuron SDK version 2.5.0 and Neuron Driver (aws-neuron-dkms) versions 2.4 or older, Neuron Runtime will return the following error message:
2022-Dec-01 09:34:12.0559 138:138 ERROR HAL:aws_hal_tpb_pooling_write_profile failed programming the engine
A Neuron Core cant be shared between two applications. If an application started using a Neuron Core all other applications trying to use the NeuronCore would fail during runtime initialization with the following message in the console and in syslog:
2021-Aug-27 23:22:12.0323 28078:28078 ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:nc1-nc1 Available:0
While loading a model(NEFF), Neuron Runtime checks the version compatibility. If the version the NEFF is incompatible with Runtime then it would fail the model load with following error message:
NEFF version mismatch supported: 1.1 received: 2.0
Use compatible versions of Neuron Compiler and Runtime. Updating to the latest version of both Neuron Compiler and Neuron Runtime is the simplest solution. If updating one of the two is not an option, please refer to the neuron-runtime-release-notes of the Neuron Runtime to determine NEFF version support.
While loading a model(NEFF), Neuron Runtime reserves both device and host memory for storing weights, ifmap and ofmap of the Model. The memory consumption of each model is different. If Neuron Runtime is unable to allocate memory then the model load would fail with the following message in syslog
kernel: [XXXXX] neuron:mc_alloc: device mempool [0:0] total 1073741568 occupied 960539030 needed 1272 available 768
The NEFF requires more NeuronCores than available on the instance.
Check for error messages in syslog similar to:
NRT: 26638:26638 ERROR TDRV:db_vtpb_get_mla_and_tpb Could not find VNC id n NRT: 26638:26638 ERROR NMGR:dlr_kelf_stage Failed to create shared io NRT: 26638:26638 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-a.json to NeuronCore NRT: 26638:26638 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: xxxxxxx, err: 2
Neuron Devices will detect any NaN generated during execution and report it. If Neuron Runtime sees NaNs are generated then it would fail the execution request with Numerical Error with the following message:
nrtd[nnnnn]: .... Error notifications found on NC .... INFER_ERROR_SUBTYPE_NUMERICAL