This document is relevant for: Trn1, Trn1n

Run Training in PyTorch Neuron Container#

Overview#

This tutorial demonstrates how to run a pytorch container on an trainium instance.

By the end of this tutorial you will be able to run simple mlp training using the container

You will use an trn1.2xlarge to test your Docker configuration for Trainium.

To find out the available neuron devices on your instance, use the command ls /dev/neuron*.

Setup Environment#

  1. Launch an Trn1 Instance
  2. Set up docker environment according to Tutorial Docker environment setup

3. A sample Dockerfile for for torch-neuron can be found here Dockerfile for Application Container. This dockerfile needs the mlp train script found here Simple MLP train script

With the files in a dir, build the image with the following command:

docker build . -f Dockerfile.pt -t neuron-container:pytorch

Run the following command to start the container

docker run -it --name pt-cont --net=host --device=/dev/neuron0 neuron-container:pytorch python3 /opt/ml/mlp_train.py

This document is relevant for: Trn1, Trn1n