Neuron 2.25.0 is released! Check What's New and Announcements for more details.

Run Training in PyTorch Neuron Container

Contents

This document is relevant for: Trn1, Trn2

Run Training in PyTorch Neuron Container#

Overview #

This tutorial demonstrates how to run a pytorch container on an trainium instance.

By the end of this tutorial you will be able to run simple mlp training using the container

You will use an trn1.2xlarge to test your Docker configuration for Trainium.

To find out the available neuron devices on your instance, use the command ls /dev/neuron*.

Setup Environment #

Launch an Trn1 Instance
- Please follow the instructions at launch an Amazon EC2 Instance to Launch an instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type.
- To get more information about instances sizes and pricing see: Trn1 web page, Inf2 web page, Inf1 web page
- Select your Amazon Machine Image (AMI) of choice, please note that Neuron supports Amazon Linux 2 AMI(HVM) - Kernel 5.10.
- When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
- After launching the instance, follow the instructions in Connect to your instance to connect to the instance
Note

If you are facing a connectivity issue during the model loading process on a Trn1 instance with Ubuntu, that could probably be because of Ubuntu limitations with multiple interfaces. To solve this problem, please follow the steps mentioned here.

Users are highly encouraged to use DLAMI to launch the instances, since DLAMIs come with the required fix.
Set up docker environment according to Tutorial Docker environment setup

3. A sample Dockerfile for for torch-neuron can be found here Dockerfile for Application Container. This dockerfile needs the mlp train script found here Simple MLP train script

With the files in a dir, build the image with the following command:

docker build . -f Dockerfile.pt -t neuron-container:pytorch

Run the following command to start the container

docker run -it --name pt-cont --net=host --device=/dev/neuron0 neuron-container:pytorch python3 /opt/ml/mlp_train.py

This document is relevant for: Trn1, Trn2