This document is relevant for: Inf2, Trn1, Trn2

LayerNorm#

In this tutorial, we implement a kernel to perform LayerNorm of a 2D tensor, as described in Layer Normalization. LayerNorm is a common normalization mechanism used in Transformer models, similar to RMSNorm. However, LayerNorm requires more vector operations to optimize compute efficiency in Vector Engine. In doing so, we will revisit the key concepts we learned in the RMSNorm and additionally learn about:

Using nki.isa APIs to efficiently compute mean and variance, and minimize the number of traversals over input data by combining multiple vector instructions into one
Take surrounding compute into consideration when deciding tensor layouts

Before diving into LayerNorm for a 2D tensor, let’s go over the LayerNorm operator for a 1D vector y defined as below:

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathrm{var}[x] + \epsilon}} * \gamma + \beta\]

The parameters are:

\(x\): Input 1D vector
\(y\): Output 1D vector, same shape as x
\(\mathbb{E}[x]\): Mean of x
\(\mathrm{var}[x]\): Variance of x
\(\epsilon\) : A small constant scalar for numerical stability
\(\gamma\), \(\beta\): LayerNorm affine transform parameters, each has the same shape as x

In Transformer models, we typically need to perform LayerNorm on a 2D input tensor (with shape: [sequence_length, hidden_size]), where the first dimension is sequence_length long corresponding to the number of tokens currently being processed, and the second dimension is the embedding dimension of each token.

Different tokens (i.e., rows in the [sequence_length, hidden_size] 2D vector) undergo different 1D LayerNorm independently. Therefore, we need to calculate different mean and variance for different rows and broadcast (i.e., share) the same \(\gamma\) , \(\beta\) parameters across the rows.

Figure below visualizes the tensor shape involved in 2D-LayerNorm, where input_tensor is 2D input vector and gamma_vector and beta_vector are affine transform parameters:

../../../_images/layernorm-tensor.png — Fig. 86 LayerNorm tensor shapes#

Compared to RMSNorm, LayerNorm requires calculations of mean and variance, instead of simple square and summation. Also, LayerNorm performs two instances of free-axis broadcast and two instances of partition-axis broadcast, while RMSNorm requires one instance of each. Therefore, LayerNorm involves way more computation (vector operations in particular) than RMSNorm.

Implement NKI kernel#

Next, we will present two versions of LayerNorm implementation, starting from a naive version using nki.language APIs and ending with an optimized version using nki.isa APIs.

Version 1: nki.language APIs only#

import neuronxcc.nki as nki
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as nisa
import numpy as np
import math

@nki.jit
def nki_layernorm_kernel_v1(input_tensor, epsilon, gamma_vector, beta_vector):
  """Computes LayerNorm.
    Used nki.language APIs only.
  """
  output_tensor = nl.ndarray(input_tensor.shape, dtype=input_tensor.dtype,
                             buffer=nl.shared_hbm)

  # Ensure that the shapes of tensors match
  assert input_tensor.shape[1] == gamma_vector.shape[0] == beta_vector.shape[0]

  # Generate tile indices for loading/storing data
  i_p_io = nl.arange(nl.tile_size.pmax)[:, None]
  i_f_io = nl.arange(input_tensor.shape[1])[None, :]
  i_p_param = nl.arange(1)[:, None]

  # Number of rows in the input tensor
  num_rows = input_tensor.shape[0]

  # Load gamma and beta, which will be reused across rows/tiles of input_tensor
  gamma_sb = nl.load(gamma_vector.reshape((1, gamma_vector.shape[0]))[i_p_param, i_f_io])
  beta_sb = nl.load(beta_vector.reshape((1, beta_vector.shape[0]))[i_p_param, i_f_io])

  # Broadcast the gamma and beta to match the dimensions of the tiles
  gamma_sb_bcast = gamma_sb.broadcast_to((nl.tile_size.pmax, gamma_vector.shape[0]))
  beta_sb_bcast = beta_sb.broadcast_to((nl.tile_size.pmax, beta_vector.shape[0]))

  # Tile partition dimension of the input tensor by nl.tile_size.pmax
  for i in nl.affine_range(math.ceil(input_tensor.shape[0]/nl.tile_size.pmax)):
    # Load input tile
    input_sb = nl.load(input_tensor[i * nl.tile_size.pmax + i_p_io, i_f_io],
                       mask=(i * nl.tile_size.pmax + i_p_io < num_rows))

    # Compute mean and variance
    mean = nl.mean(input_sb, axis=1)
    # Trick to calculate var with mean: mean(x^2) - mean(x)^2
    var = nl.mean(nl.square(input_sb), axis=1) - mean * mean

    # Normalize the input by shifting with the mean 
    # and scaling with rsqrt of variance and epsilon
    shift_scale_tensor = (input_sb - mean) * nl.rsqrt(var + epsilon)
    
    # Scale the normalized tile using gamma and add beta
    output_sb = shift_scale_tensor * gamma_sb_bcast + beta_sb_bcast

    nl.store(output_tensor[i * nl.tile_size.pmax + i_p_io, i_f_io], value=output_sb,
             mask=(i * nl.tile_size.pmax + i_p_io < num_rows))

  return output_tensor

To adhere to NKI’s tile-size considerations (Tile Size Considerations), we limit the partition axis size of input_tensor tile to be 128 (nl.tile_size.pmax).
Load gamma and beta, and perform the partition-axis broadcast:
- The multiplication with shift_scale_tensor requires broadcasting of gamma and beta across all partitions(broadcast_to() API)
The trip count of the compute loop is math.ceil(input_tensor.shape[0]/nl.tile_size.pmax). In cases where input_tensor.shape[0] is not a multiple of nl.tile_size.pmax, we can disable ineffectual data movement or compute in the last iteration using the mask field.
Within the compute loop:
- We load one tile of input_tensor with shape (nl.tile_size.pmax, input_tensor.shape[1]) using nl.load API. We guard the loading boundary by specifying mask=(i * nl.tile_size.pmax + i_p_io < input_tensor.shape[0]), which ensures we don’t access out-of-bound memory when the number of rows in input_tensor is not a multiple of nl.tile_size.pmax.
- Compute the mean and variance using nki.language.mean
- Normalize one tile of input_tensor using mean and variance. The variance is preprocessed using nki.language.rsqrt
- Scale the normalized tile using gamma and add beta
- Finally, we store the normalized tile back into HBM using the nl.store API. We guard the store boundary similar to load boundary using the mask field.

Next, we will optimize the above implementation using nki.isa APIs in version 2

Version 2: `nki.isa` APIs to calculate mean/variance and perform shift/scale#

@nki.jit
def nki_layernorm_kernel_v2(input_tensor, epsilon, gamma_vector, beta_vector):
  """Computes LayerNorm.
    Used nki.isa APIs to calculate mean/variance and perform shift/scale.
  """
  output_tensor = nl.ndarray(input_tensor.shape, dtype=input_tensor.dtype,
                             buffer=nl.shared_hbm)

  # Ensure that the shapes of tensors match
  assert input_tensor.shape[1] == gamma_vector.shape[0] == beta_vector.shape[0]

  # Generate tile indices for loading/storing data
  i_p_io = nl.arange(nl.tile_size.pmax)[:, None]
  i_f_io = nl.arange(input_tensor.shape[1])[None, :]
  i_p_param = nl.arange(1)[:, None]

  # Number of rows in the input tensor
  num_rows = input_tensor.shape[0]

  # Load gamma and beta, which will be reused across rows/tiles of input_tensor
  gamma_sb = nl.load(gamma_vector.reshape((1, gamma_vector.shape[0]))[i_p_param, i_f_io])
  beta_sb = nl.load(beta_vector.reshape((1, beta_vector.shape[0]))[i_p_param, i_f_io])

  # Broadcast the gamma and beta to match the dimensions of the tiles
  gamma_sb_bcast = gamma_sb.broadcast_to((nl.tile_size.pmax, gamma_vector.shape[0]))
  beta_sb_bcast = beta_sb.broadcast_to((nl.tile_size.pmax, beta_vector.shape[0]))

  # Tile partition dimension of the input tensor by nl.tile_size.pmax
  for i in nl.affine_range(math.ceil(input_tensor.shape[0]/nl.tile_size.pmax)):
    # Load input tile
    input_sb = nl.load(input_tensor[i * nl.tile_size.pmax + i_p_io, i_f_io],
                       mask=(i * nl.tile_size.pmax + i_p_io < num_rows))

    # Tile free dimension of the input tensor by nl.tile_size.bn_stats_fmax, 
    # as bn_stats has a free dimension size limit
    i_f_bn = nl.arange(nl.tile_size.bn_stats_fmax)[None, :]
    i_f_stats = nl.arange(6)[None, :]
    num_bn_stats = math.ceil(input_tensor.shape[1]/nl.tile_size.bn_stats_fmax)
    stats_results = nl.ndarray((nl.tile_size.pmax, 6*num_bn_stats), dtype=np.float32)
    for j in nl.affine_range(num_bn_stats):
      stats_results[i_p_io, j * 6 + i_f_stats] = nisa.bn_stats(
              input_sb[i_p_io, j * nl.tile_size.bn_stats_fmax + i_f_bn],
              mask=(j * nl.tile_size.bn_stats_fmax + i_f_bn < input_tensor.shape[1]),
              dtype=np.float32)
      
    # Aggregate bn_stats results to compute mean and var
    i_f_aggr = nl.arange(6*num_bn_stats)[None, :]
    mean_var = nisa.bn_aggr(stats_results[i_p_io, i_f_aggr])
    mean = mean_var[i_p_io, 0]
    var = mean_var[i_p_io, 1]

    # Get reciprocal of sqrt(var + epsilon)
    scale_var = nl.rsqrt(var + epsilon)

    # Putting the shift and scale together in one line to trigger two alu_op tensor_vector instruction
    # shift_scale_tensor = (input_sb - mean_var[i_p_stats, i_f_mean]) * scale_var
    shift_scale_tensor = nisa.tensor_scalar(data=input_sb, op0=np.subtract,
                                            operand0=mean,
                                            op1=np.multiply,
                                            operand1=scale_var)
    
    # Scale the normalized tile using gamma and add beta
    output_sb = shift_scale_tensor * gamma_sb_bcast + beta_sb_bcast

    nl.store(output_tensor[i * nl.tile_size.pmax + i_p_io, i_f_io], value=output_sb,
             mask=(i * nl.tile_size.pmax + i_p_io < num_rows))

  return output_tensor

Considering the free dimension size limit of nki.isa.bn_stats, which is 512(nl.tile_size.bn_stats_fmax), the trip count of bn_stats compute loop is math.ceil(input_tensor.shape[1]/nl.tile_size.bn_stats_fmax).
Used nki.isa.bn_stats and nki.isa.bn_aggr to calculate the mean and variance
Used nki.isa.tensor_scalar to do shift and scale of mean and variance in a single instruction

Performance in Version 1 and Version 2#

Let’s assume the data type for the kernel is float32 and that the SBUF partition is sufficiently large to hold the intermediate data simultaneously without significant spilling. Define the variable N = input_tensor.shape[1].

Compute mean and variance:
- Version 1 : The performance cost of the mean calculation is N Vector Engine cycles, and the variance calculation is N Scalar Engine + 2N Vector Engine cycles.
- Version 2 : By replacing these calculations with bn_stats and bn_aggr APIs, the cost is roughly reduced to N Vector Engine cycles, ignoring the cost of nki.isa.bn_aggr, assuming N is sufficiently large.
Perform shift and scale of mean and variance in a single instruction:
- Version 1 : The performance cost of the shift/scale calculation requires two small instructions (nl.rsqrt(var + epsilon)) and two instructions with each iterating over N elements per partition (shift and scale, 2N).
- Version 2 : By replacing these calculations with the tensor_scalar API, the cost is reduced to N Vector Engine cycles

The latency measured on trn1 using an input tensor of (300, 1000) shows a 14.9% improvement.

>>>> Running version v1.
Latency results are:
NCLatency:
p0 = 2306us
p1 = 2306us
p10 = 2308us
p25 = 2309us
p50 = 2311us
p90 = 2313us
p99 = 2314us
p100 = 2314us

>>>> Running version v2.
Latency results are:
NCLatency:
p0 = 1963us
p1 = 1963us
p10 = 1965us
p25 = 1966us
p50 = 1969us
p90 = 1972us
p99 = 1974us
p100 = 1975us

Launching kernel and testing correctness#

Below is a reference PyTorch implementation of LayerNorm, which we use to verify our NKI kernel output against the reference output

import torch
from torch_xla.core import xla_model as xm
import argparse
import os

os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"

# Reference torch implementation
def layernorm_layer(input_tensor, epsilon, gamma_vector, beta_vector):
    # Compute the mean and variance of the input tensor along the last dimension
    mean = input_tensor.mean(dim=-1, keepdim=True)
    variance = input_tensor.var(dim=-1, keepdim=True, unbiased=False)
    # Subtract the mean from the input and divide by the square root of the variance plus epsilon
    normalized_input = (input_tensor - mean) / torch.sqrt(variance + epsilon)
    # Apply the affine transformation
    normalized_input = normalized_input * gamma_vector + beta_vector
    return normalized_input

def parse_args():
    parser = argparse.ArgumentParser(
    """Run LayerNorm pytorch implementation.
    """)
    parser.add_argument("--nrows",
                        default=4*1024,
                        type=int,
                        help="""The number of input rows""")
    parser.add_argument("--ncols",
                        default=8*1024,
                        type=int,
                        help="""The number of input columns""")
    parser.add_argument("--version",
            default="v1",
            choices=["v1", "v2"],
            help="Test versions")
    args = parser.parse_args()
    return args


from neuronxcc.nki.docs.examples.layernorm.layernorm_nki_kernel import nki_layernorm_kernel_v1, \
  nki_layernorm_kernel_v2

if __name__ == "__main__":
    args = parse_args()
    func_dict = {"v1": nki_layernorm_kernel_v1,
                 "v2": nki_layernorm_kernel_v2,
                 }

    device = xm.xla_device()
    num_rows = args.nrows
    num_cols = args.ncols

    # Generate toy example
    input_tensor = torch.rand((num_rows, num_cols), dtype=torch.float32)
    gamma_vector = torch.rand((num_cols), dtype=torch.float32)
    beta_vector = torch.rand((num_cols), dtype=torch.float32)
    epsilon = 1e-5

    # Compute torch layernorm layer in cpu
    output_torch = layernorm_layer(input_tensor, epsilon, gamma_vector, beta_vector)

    # Copy tensors to NeuronDevice
    input_tensor = input_tensor.to(device=device)
    gamma_vector = gamma_vector.to(device=device)
    beta_vector = beta_vector.to(device=device)

    print(f">>>> Running version {args.version}.")
    func = func_dict[args.version]

    # add nki_jit decorator

    # Compute NKI layernorm kernel in NeuronDevice
    xm.mark_step()
    output_nki = func(input_tensor, epsilon, gamma_vector, beta_vector)
    xm.mark_step()
    output_nki = output_nki.to(device='cpu')

    # Accuracy check : Compare the output tensors
    allclose = torch.allclose(output_torch, output_nki, atol=1e-3, rtol=1e-2)
    if allclose:
        print("NKI and Torch match")
    else:
        print("NKI and Torch differ")

Download All Source Code#

Click the links to download source code of the kernels and the testing code discussed in this tutorial.

PyTorch reference implementation: layernorm_torch.py
Two versions of NKI kernels: layernorm_nki_kernel.py

You can also view the source code in the GitHub repository nki_samples

Example usage of the scripts#

Performance mode

Check the performance numbers for nki_layernorm_kernel_v1 and nki_layernorm_kernel_v2, and generate NEFF files for profiling:

python3 layernorm_nki_kernel.py --mode perfs

Accuracy mode

Check NKI kernel accuracy against PyTorch implementation:

python3 layernorm_torch.py --version v1
python3 layernorm_torch.py --version v2

Check optimized Layernorm kernel(nki_layernorm_kernel_v2) accuracy against nki_layernorm_kernel_v1:

python3 layernorm_nki_kernel.py --mode accuracy

Input tensor size

python3 layernorm_torch.py --nrows 4096 --ncols 8192
python3 layernorm_nki_kernel.py --nrows 4096 --ncols 8192