This document is relevant for: Inf2, Trn1, Trn1n

Neuron Custom C++ Operators in MLP Training#

In this tutorial we’ll demonstrate how to prepare a PyTorch model that contains a custom operator (ie. CppExtension) for Neuron compilation to run on Trainium EC2 instances. To learn more about Neuron CustomOps see Neuron Custom C++ Operators [Beta]. For a deeper dive on MNIST or Multi-Layer Perceptron models, see the Multi-Layer Perceptron Training Tutorial. This tutorial assumes the reader is familiar with PyTorch Custom Extensions.

Setup Environment and Download Examples#

Before running the tutorial please follow the installation instructions at:

Note

The name of aws-neuronx-gpsimd-customop has been changed to aws-neuronx-gpsimd-customop-lib as of the neuron 2.10 release.

Note

Custom C++ Operators are supported as of Neuron SDK Version 2.7 as a beta feature. As such this feature is not installed by default, additional tooling and library packages (RPM and DEB) are required.

For AL2023 only, the following packages need be installed as dependencies:

sudo yum install libnsl
sudo yum install libxcrypt-compat

On AL2 and AL2023, they can be installed with the following commands:

sudo yum remove python3-devel -y
sudo yum remove aws-neuronx-gpsimd-tools-0.* -y
sudo yum remove aws-neuronx-gpsimd-customop-lib-0.* -y

sudo yum install python3-devel -y
sudo yum install aws-neuronx-gpsimd-tools-0.* -y
sudo yum install aws-neuronx-gpsimd-customop-lib-0.* -y

On Ubuntu, they can be installed with the following commands:

sudo apt-get remove python3-dev -y
sudo apt-get remove aws-neuronx-gpsimd-tools=0.* -y
sudo apt-get remove aws-neuronx-gpsimd-customop-lib=0.* -y

sudo apt-get install python3-dev -y
sudo apt-get install aws-neuronx-gpsimd-tools=0.* -y
sudo apt-get install aws-neuronx-gpsimd-customop-lib=0.* -y

For all the commands below, make sure you are in the virtual environment that you have created above before you run the commands:

source ~/aws_neuron_venv_pytorch/bin/activate

Install dependencies for PyTorch Custom Extensions in your environment by running:

pip install regex
pip install ninja

The ninja package is only needed for the reference CPU example. It is not needed by Neuron to run on Trainium instances.

To download the source code for this tutorial, do:

git clone https://github.com/aws-neuron/aws-neuron-samples.git
cd aws-neuron-samples/torch-neuronx/training/customop_mlp

In the customop_mlp directory there are two subdirectories. The pytorch directory contains an example model and training script using a custom operator that runs using the cpu device with standard PyTorch APIs and libraries (ie. not specific to AWS/Neuron). The neuron directory contains a version of the same model and training script with the custom operator ported to Neuron to run on trn1 using the XLA device.

Basic PyTorch Custom Relu Operator#

For the next few sections we’ll review the example model in the pytorch directory. This is a condensed and simplified explanation of PyTorch C++ Extensions, for more details see the PyTorch documentation. In my_ops.py we implement a custom relu activation op as a torch autograd function so that we can use it in a training loop:

import torch

torch.ops.load_library('librelu.so')

class Relu(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return torch.ops.my_ops.relu_forward(input)

    @staticmethod
    def backward(ctx, grad):
        input, = ctx.saved_tensors
        return torch.ops.my_ops.relu_backward(grad, input), None

Notice that here we first load librelu.so using the load_library API. And then call the relu_forward and relu_backward functions from our library within the relevant static methods.

We implemented these two library functions in the relu.cpp file:

torch::Tensor relu_forward(const torch::Tensor& t_in) {
    ...
    t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_in_acc[i][j] : 0.0;
    ...
}

torch::Tensor relu_backward(const torch::Tensor& t_grad, const torch::Tensor& t_in) {
    ...
    t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_grad_acc[i][j] : 0.0;
    ...
}

TORCH_LIBRARY(my_ops, m) {
    m.def("relu_forward", &relu_forward);
    m.def("relu_backward", &relu_backward);
}

And then built them into a library using the PyTorch Cpp Extension APIs in the build.py script:

torch.utils.cpp_extension.load(
    name='librelu',
    sources=['relu.cpp'],
    is_python_module=False,
    build_directory=os.getcwd()
)

Run python build.py to produce the librelu.so library.

Multi-layer perceptron MNIST model#

In model.py, we define the multi-layer perceptron (MLP) MNIST model with 3 linear layers and a custom ReLU activation, followed by a log-softmax layer. Highlighted below are the relevant custom changes in the model.py file:

import torch
import torch.nn as nn
from torch.nn import functional as F
import my_ops

# Declare 3-layer MLP for MNIST dataset
class MLP(nn.Module):
    def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, layers[0])
        self.fc2 = nn.Linear(layers[0], layers[1])
        self.fc3 = nn.Linear(layers[1], output_size)

    def forward(self, x):
        f1 = self.fc1(x)
        r1 = my_ops.Relu.apply(f1)
        f2 = self.fc2(r1)
        r2 = my_ops.Relu.apply(f2)
        f3 = self.fc3(r2)
        return torch.log_softmax(f3, dim=1)

Training the MLP model on CPU#

In the train_cpu.py script we load the MNIST train dataset, instantiate the MLP model, and use device='cpu' to execute on the host CPU. Expected CPU output:

----------Training ---------------
Train throughput *(*iter/sec*)*: *286*.96994718801335
Final loss is *0*.1040
----------End Training ---------------

Neuron Relu CustomOp#

Now switch over into the neuron directory. To migrate our PyTorch customOp to Neuron, we have to make a few small changes. First, we create a new shape.cpp file to implement our shape function as required by XLA (see Neuron Custom C++ Operators Developer Guide [Beta] for details). We also replace the TORCH_LIBRARY API with NEURON_LIBRARY.

torch::Tensor relu_fwd_shape(torch::Tensor t_in) {
    torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
    return t_out;
}

torch::Tensor relu_bwd_shape(torch::Tensor t_grad, torch::Tensor t_in) {
    torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
    return t_out;
}

NEURON_LIBRARY(my_ops, m) {
    m.def("relu_forward", &relu_fwd_shape, "relu_forward");
    m.def("relu_backward", &relu_bwd_shape, "relu_backward");
}

And then we build it using the torch_neuronx package in build.py:

from torch_neuronx.xla_impl import custom_op

custom_op.load(
    name='relu',
    compute_srcs=['relu.cpp'],
    shape_srcs=['shape.cpp'],
    build_directory=os.getcwd()
)

Notice that here we specify both the relu.cpp and shape.cpp files separately. This is because the shape functions will be compiled with an x86 compiler and run on the host during the XLA compilation, and the compute functions will be compiled for the NeuronCore device and executed during the training loop. Running build.py produces the same librelu.so as in the CPU example, but compiles the source code to execute on the NeuronCore.

In our my_ops.py file we just use the torch_neuronx API to load our new library and execute our customOp exactly the same way we did before:

import torch
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load_library('librelu.so')

class Relu(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return torch.ops.my_ops.relu_forward(input)

    @staticmethod
    def backward(ctx, grad):
        input, = ctx.saved_tensors
        return torch.ops.my_ops.relu_backward(grad, input), None

Training the MLP model on Trainium#

In the train.py script we modify the CPU training script train_cpu.py to run with PyTorch Neuron torch_xla. Expected output on a trn1 instance:

----------Training ---------------
2023-02-02 22 (tel:2023020222):46:58.000299: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.0.0.8683a0+c94c3936c/MODULE_4447837791278761679/MODULE_0_SyncTensorsGraph.329_4447837791278761679_ip-172-31-38-167.us-west-2.compute.internal-49ad7ade-14011-5f3bf523d8788/1650ba41-bcfd-4d15-9038-16d391c4a57c/MODULE_0_SyncTensorsGraph.329_4447837791278761679_ip-172-31-38-167.us-west-2.compute.internal-49ad7ade-14011-5f3bf523d8788.neff. Exiting with a successfully compiled graph
2023-02-02 22 (tel:2023020222):46:58.000433: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.0.0.8683a0+c94c3936c/MODULE_16964505026440903899/MODULE_1_SyncTensorsGraph.401_16964505026440903899_ip-172-31-38-167.us-west-2.compute.internal-4d0cabba-14011-5f3bf529794a3/23d74230-59dd-4347-b247-fa98aed416bd/MODULE_1_SyncTensorsGraph.401_16964505026440903899_ip-172-31-38-167.us-west-2.compute.internal-4d0cabba-14011-5f3bf529794a3.neff. Exiting with a successfully compiled graph
Train throughput (iter/sec): 117.47151142662648
Final loss is 0.1970
----------End Training ---------------

This document is relevant for: Inf2, Trn1, Trn1n