This document is relevant for: Inf2
, Trn1
, Trn1n
Train your model on EC2#
Description#
You can use a single Trn1 instance as a development environment to compile and train Neuron models. In this developer flow, you provision an EC2 Trn1 instance using a Deep Learming AMI (DLAMI) and execute the two steps of the development flow in the same instance. The DLAMI comes pre-packaged with the Neuron frameworks, compiler, and required runtimes to complete the flow. Development happens through Jupyter Notebooks or using a secure shell (ssh) connection in terminal. Follow the steps bellow to setup your environment.
Setup Environment#
1. Launch an Trn1 Instance#
Please follow the instructions at launch an Amazon EC2 Instance to Launch an Trn1 instance, when choosing the instance type at the EC2 console. Please make sure to select the correct instance type. To get more information about Trn1 instances sizes and pricing see Trn1 web page.
Select your Amazon Machine Image (AMI) of choice, please note that Neuron support Ubuntu 18 AMI or Amazon Linux 2 AMI, you can also choose Ubuntu 18 or Amazon Linux 2 Deep Learning AMI (DLAMI)
When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
After launching the instance, follow the instructions in Connect to your instance to connect to the instance
Note
If you are facing a connectivity issue during the model loading process on a Trn1 instance with Ubuntu, that could probably be because of Ubuntu limitations with multiple interfaces. To solve this problem, please follow the steps mentioned here.
Users are highly encouraged to use DLAMI to launch the instances, since DLAMIs come with the required fix.
2. Set up a development environment#
Enable PyTorch-Neuron#
Note
Instructions in this page only apply to setting up Neuron components on Linux host running Ubuntu or Amazon Linux AMI.
When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
# Configure Linux for Neuron repository updates . /etc/os-release sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main EOF wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add - # Update OS packages sudo apt-get update -y # Install git sudo apt-get install git -y # Install OS headers sudo apt-get install linux-headers-$(uname -r) -y # Remove preinstalled packages and Install Neuron Driver and Runtime sudo apt-get remove aws-neuron-dkms -y sudo apt-get remove aws-neuronx-dkms -y sudo apt-get remove aws-neuronx-oci-hook -y sudo apt-get remove aws-neuronx-runtime-lib -y sudo apt-get remove aws-neuronx-collectives -y sudo apt-get install aws-neuronx-dkms=2.* -y sudo apt-get install aws-neuronx-oci-hook=2.* -y sudo apt-get install aws-neuronx-runtime-lib=2.* -y sudo apt-get install aws-neuronx-collectives=2.* -y # Install EFA Driver(only required for multi-instance training) curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key cat aws-efa-installer.key | gpg --fingerprint wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig tar -xvf aws-efa-installer-latest.tar.gz cd aws-efa-installer && sudo bash efa_installer.sh --yes cd sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer # Remove pre-installed package and Install Neuron Tools sudo apt-get remove aws-neuron-tools -y sudo apt-get remove aws-neuronx-tools -y sudo apt-get install aws-neuronx-tools=2.* -y export PATH=/opt/aws/neuron/bin:$PATH # Install Python venv and activate Python virtual environment to install # Neuron pip packages. sudo apt install python3.8-venv python3.8 -m venv aws_neuron_venv_pytorch source aws_neuron_venv_pytorch/bin/activate pip install -U pip # Install wget, awscli pip install wget pip install awscli # Install packages from repos python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com" # Install Python packages - Transformers package is needed for BERT python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*"Note
Instructions in this page only apply to setting up Neuron components on Linux host running Ubuntu or Amazon Linux AMI.
When launching a Trn1, please adjust your primary EBS volume size to a minimum of 512GB.
# Configure Linux for Neuron repository updates sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF [neuron] name=Neuron YUM Repository baseurl=https://yum.repos.neuron.amazonaws.com enabled=1 metadata_expire=0 EOF sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB # Install OS headers sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y # Update OS packages sudo yum update -y # Install git sudo yum install git -y # Remove preinstalled packages and Install Neuron Driver and Runtime sudo yum remove aws-neuron-dkms -y sudo yum remove aws-neuronx-dkms -y sudo yum remove aws-neuronx-oci-hook -y sudo yum remove aws-neuronx-runtime-lib -y sudo yum remove aws-neuronx-collectives -y sudo yum install aws-neuronx-dkms-2.* -y sudo yum install aws-neuronx-oci-hook-2.* -y sudo yum install aws-neuronx-runtime-lib-2.* -y sudo yum install aws-neuronx-collectives-2.* -y # Install EFA Driver(only required for multi-instance training) curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key cat aws-efa-installer.key | gpg --fingerprint wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig tar -xvf aws-efa-installer-latest.tar.gz cd aws-efa-installer && sudo bash efa_installer.sh --yes cd sudo rm -rf aws-efa-installer-latest.tar.gz aws-efa-installer # Remove pre-installed package and Install Neuron Tools sudo yum remove aws-neuron-tools -y sudo yum remove aws-neuronx-tools -y sudo yum install aws-neuronx-tools-2.* -y export PATH=/opt/aws/neuron/bin:$PATH # Install Python venv and activate Python virtual environment to install # Neuron pip packages. python3.7 -m venv aws_neuron_venv_pytorch source aws_neuron_venv_pytorch/bin/activate python -m pip install -U pip # Install wget, awscli pip install wget pip install awscli # Install packages from repos python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com" # Install Python packages - Transformers package is needed for BERT python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*"
3. Set up Jupyter notebook#
To develop from a Jupyter notebook see Jupyter Notebook QuickStart
You can also run a Jupyter notebook as a script, first enable the ML framework Conda or Python environment of your choice and see Running Jupyter Notebook as script for instructions.
This document is relevant for: Inf2
, Trn1
, Trn1n