.. _container-troubleshooting: Troubleshooting Neuron Containers ================================= This document aims to provide more information on how to fix issues you might encounter while using the Neuron Containers. For each issue we will provide an explanation of what happened and what can potentially correct the issue. If your issue is not listed below or you have a more nuanced problem, contact us via `issues `__ posted to this repo, the `AWS Neuron developer forum `__, or through AWS support. Neuron Container includes the following Neuron Components. For issues relating to these components inside the container refer the individual component troubleshooting guides :ref:`general-troubleshooting` * Neuron Runtime/Driver * Pytorch/Tenosrflow/MXNet frameworks * Libfabric/EFA The following are container specific issues Neuron Device Not found ----------------------- The neuron container expects the neuron devices to be exposed to the container as referenced in :ref:`container-devices`. Please look at the container logs to see messages like below :: 2022-Sep-08 17:55:23.0768 19:19 ERROR TDRV:tdrv_get_dev_info No neuron device available If the above message is seen then devices are not exposed to container Solution '''''''' * Refer :ref:`container-devices` and make sure the devices are exposed to container * If specific cores are being used refer :ref:`container-cores` and make sure the cores are exposed to container * In kubernetes environment refer :ref:`k8s-specify-devices` or :ref:`k8s-specify-cores` to make sure neuron devices/cores are there in pods container spec Contiguous Device ID's ----------------------- Neuron runtime expects the inferentia/trainium device id's to be contigious. If the device id's are not contiguous you might see error messages like below :: 2022-Sep-08 21:52:11.0307 7:7 ERROR TDRV:tdrv_init_mla_phase1 Could not open the nd1 :: 2022-Sep-08 23:00:05.0667 8:8 ERROR NRT:nrt_allocate_neuron_cores Neuron cores are not contiguous Solution '''''''' * In the docker run command make sure the devices specified using --device are all contiguous * If oci neuron hook is used and the env variable AWS_NEURON_VISIBLE_DEVICES is used then make sure the devices specified are all contiguous * In kubernetes environment with just the neuron device plugin running there is no guarantee that the devices allocated will be contiguous. Make sure to run the neuron scheduler extension as specified in :ref:`neuron-k8-scheduler-ext`