Setup#
Neuronx Distributed Training framework is built on top of NeuronxDistributed (NxD) , NeMo libraries and PyTorch-Lightning. The guide below will provide a step-by-step instructions on how to setup the environment to run training using NeuronX Distributed Training framework.
Setup a python Virtual Environment#
Let’s first setup a virtual env for our development. This can be done using the command below:
python3 -m venv env
source env/bin/activate
Installing Neuron Dependencies#
Install the neuron packages using the command:
pip install --upgrade neuronx-cc==2.* torch-neuronx torchvision neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
Building Apex#
Since NxD Training is built on top of NeMo, we have to install its dependencies too. One of which is the Apex library. NeMo uses it for few of the fused module implementations.
Note
NeMo used to use Apex for all distributed training APIs. Since we are using NxD for the same purpose, the use of Apex for this framework is very minimal. It’s been added as a dependency since some of the minor imports inside NeMo will break without it. Hence, when building Apex, we build a slim CPU version using the instructions below:
Clone Apex repo
git clone https://github.com/ericharper/apex.git ~/
cd apex
git checkout nm_v1.14.0
Replace the contents of the
setup.py
with the following contents:
import sys
import warnings
import os
from packaging.version import parse, Version
from setuptools import setup, find_packages
import subprocess
import torch
from torch.utils.cpp_extension import BuildExtension, CppExtension, CUDAExtension, CUDA_HOME, load
setup(
name="apex",
version="0.1",
packages=find_packages(
exclude=("build", "csrc", "include", "tests", "dist", "docs", "tests", "examples", "apex.egg-info",)
),
install_requires=["packaging>20.6",],
description="PyTorch Extensions written by NVIDIA",
)
Install python dependencies:
pip install packaging wheel
Build the wheel using the command:
python setup.py bdist_wheel
After this, you should see the wheel at
dist/
. You can use this for installation in the next section.Come out of the
apex
directory usingcd ..
.
Installing the requirements#
Download the requirements.txt
using the command:
wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-training/master/requirements.txt
We can now install the dependencies of the library using the following command:
pip install -r requirements.txt ~/apex/dist/apex-0.1-py3-none-any.whl
Installing Neuronx Distributed Training framework#
To install the library, one can run the following command:
pip install neuronx_distributed_training --extra-index-url https://pip.repos.neuron.amazonaws.com
Common failures during installation#
This section goes over the common failures one can see during setup and how to resolve them.
``ModuleNotFoundError: No module named ‘Cython’``
You may have to install Cython explicitly using
pip install Cython
Error while building ``youtokentome``
If you get an error that says
Python.h file not found
, you may have to install python-dev and recreate the virtual env. To install python-dev, you can use the command:sudo apt-get install python-dev
Mismatched torch and torch-xla version
When you see an error that looks like:
ImportError: env/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c109TupleTypeC1ESt6vectorINS_4Type24SingletonOrSharedTypePtrIS2_EESaIS4_EENS_8optionalINS_13QualifiedNameEEESt10shared_ptrINS_14FunctionSchemaEE
It indicates that the major versions of torch and torch-xla don't match.
Note
If you install torch again, make sure to install the corresponding torchvision version else that would have a conflict.
Torch vision version error
The below error indicates incorrect torchvision version. If installing
torch=2.1
, installtorchvision=0.16
(This link shows which version of torchvision is compatible with which version of torch).
ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator
and (if registered from C++) loaded it via torch.ops.load_library.`
Matplotlib lock error
If you see the below error:
TimeoutError: Lock error: Matplotlib failed to acquire the following lock file
This error means there is some contention in compute/worker nodes to access the matlotlib cache, and hence the timeout
error. To resolve this error, add or run ``python -c 'import matplotlib.pyplot as plt'`` command as part of your setup.
This will create a matplotlib cache and avoid the race condition.