This document is relevant for: Inf1, Trn1

Troubleshooting Neuron Containers#

This document aims to provide more information on how to fix issues you might encounter while using the Neuron Containers. For each issue we will provide an explanation of what happened and what can potentially correct the issue.

If your issue is not listed below or you have a more nuanced problem, contact us via issues posted to this repo, the AWS Neuron developer forum, or through AWS support.

Neuron Container includes the following Neuron Components. For issues relating to these components inside the container refer the individual component troubleshooting guides Troubleshooting Guide

  • Neuron Runtime/Driver

  • Pytorch/Tenosrflow/MXNet frameworks

  • Libfabric/EFA

The following are container specific issues

Neuron Device Not found#

The neuron container expects the neuron devices to be exposed to the container as referenced in Devices.

Please look at the container logs to see messages like below

2022-Sep-08 17:55:23.0768    19:19    ERROR  TDRV:tdrv_get_dev_info                       No neuron device available

If the above message is seen then devices are not exposed to container

Solution#

  • Refer Devices and make sure the devices are exposed to container

  • If specific cores are being used refer Cores and make sure the cores are exposed to container

  • In kubernetes environment refer k8s-specify-devices or k8s-specify-cores to make sure neuron devices/cores are there in pods container spec

Contiguous Device ID’s#

Neuron runtime expects the inferentia/trainium device id’s to be contigious. If the device id’s are not contiguous you might see error messages like below

2022-Sep-08 21:52:11.0307     7:7     ERROR  TDRV:tdrv_init_mla_phase1                    Could not open the nd1
2022-Sep-08 23:00:05.0667     8:8     ERROR   NRT:nrt_allocate_neuron_cores               Neuron cores are not contiguous

Solution#

  • In the docker run command make sure the devices specified using –device are all contiguous

  • If oci neuron hook is used and the env variable AWS_NEURON_VISIBLE_DEVICES is used then make sure the

devices specified are all contiguous * In kubernetes environment with just the neuron device plugin running there is no guarantee that the devices allocated will be contiguous. Make sure to run the neuron scheduler extension as specified in neuron-k8-scheduler-ext

This document is relevant for: Inf1, Trn1