This document is relevant for: Trn1
, Trn1n
Run Training in PyTorch Neuron Container#
Overview#
This tutorial demonstrates how to run a pytorch container on an trainium instance.
By the end of this tutorial you will be able to run simple mlp training using the container
You will use an trn1.2xlarge to test your Docker configuration for Trainium.
To find out the available neuron devices on your instance, use the command ls /dev/neuron*
.
Setup Environment#
- Launch an Trn1 Instance
Set up docker environment according to Tutorial Docker environment setup
3. A sample Dockerfile for for torch-neuron can be found here Dockerfile for Application Container. This dockerfile needs the mlp train script found here Simple MLP train script
With the files in a dir, build the image with the following command:
docker build . -f Dockerfile.pt -t neuron-container:pytorch
Run the following command to start the container
docker run -it --name pt-cont --net=host --device=/dev/neuron0 neuron-container:pytorch python3 /opt/ml/mlp_train.py
This document is relevant for: Trn1
, Trn1n