This document is relevant for: Trn2, Trn3

Conv1D Kernel API Reference#

Implements 1D convolution using tensor engine with a replication strategy for efficient computation.

The kernel supports:

Arbitrary stride, padding, and dilation values
Optional bias addition
Activation function fusion
LNC sharding on the output channel dimension

Intended usage range:

Kernel size (K): 1 to 128
Sequence length (L): 1 to 4096
Input channels (C_in): 1 to 4096
Output channels (C_out): 1 to 4096
Batch size (B): Any positive integer

Background#

The conv1d kernel applies 1D convolution filters across the input sequence dimension. It uses a replication strategy to efficiently utilize the tensor engine by stacking multiple filter positions along the partition dimension.

API Reference#

Source code for this kernel API can be found at: conv1d.py

conv1d#

nkilib.experimental.conv.conv1d(x_in: nl.ndarray, filters: nl.ndarray, bias: Optional[nl.ndarray] = None, stride: int = 1, padding: tuple[int, int] = (0, 0), dilation: int = 1, activation_fn: Optional[ActFnType] = None, lnc_shard: bool = False) → nl.ndarray#

1D Convolution operation using tensor engine with replication strategy.

Parameters:

x_in (nl.ndarray) – [B, C_in, L], Input tensor on HBM.
filters (nl.ndarray) – [K, C_in, C_out], Convolution filter weights on HBM.
bias (Optional[nl.ndarray]) – [C_out], Optional bias tensor on HBM. Default None.
stride (int) – Stride for convolution. Must be >= 1. Default 1.
padding (tuple[int, int]) – Tuple of (left_pad, right_pad). Must be non-negative. Default (0, 0).
dilation (int) – Dilation factor for dilated convolution. Must be >= 1. Default 1.
activation_fn (Optional[ActFnType]) – Optional activation function to fuse. Default None.
lnc_shard (bool) – If True, shard computation across LNC cores on C_out dimension. Default False.

Returns:

[B, C_out, L_out], Output tensor on HBM where L_out = (L + pad_left + pad_right - dilation * (K - 1) - 1) // stride + 1

Return type:

nl.ndarray

Notes:

All input tensors (x_in, filters, bias) must have the same dtype
Input channels C_in must match filter channels
Uses replication strategy to stack K filter positions along partition dimension
Partition alignment rules limit K replication factor based on C_in tile size
Memory management uses SbufManager with multi-buffering for efficiency

Dimensions:

B: Batch size
C_in: Number of input channels
C_out: Number of output channels
L: Input sequence length
L_out: Output sequence length = (L + pad_left + pad_right - dilation * (K - 1) - 1) // stride + 1

This document is relevant for: Trn2, Trn3

Conv1D Kernel API Reference

Contents

Conv1D Kernel API Reference#

Background#

API Reference#

conv1d#