This document is relevant for: Trn1, Trn2, Trn3

Conv1D Kernel API Reference#

Implements 1D convolution using tensor engine with a replication strategy for efficient computation.

The kernel supports:

  • Arbitrary stride, padding, and dilation values

  • Optional bias addition

  • Activation function fusion

  • LNC sharding on the output channel dimension

Intended usage range:

  • Kernel size (K): 1 to 128

  • Sequence length (L): 1 to 4096

  • Input channels (C_in): 1 to 4096

  • Output channels (C_out): 1 to 4096

  • Batch size (B): Any positive integer

Background#

The conv1d kernel applies 1D convolution filters across the input sequence dimension. It uses a replication strategy to efficiently utilize the tensor engine by stacking multiple filter positions along the partition dimension.

API Reference#

Source code for this kernel API can be found at: conv1d.py

conv1d#

nkilib.experimental.conv.conv1d(x_in: nl.ndarray, filters: nl.ndarray, bias: Optional[nl.ndarray] = None, stride: int = 1, padding: tuple[int, int] = (0, 0), dilation: int = 1, activation_fn: Optional[ActFnType] = None, lnc_shard: bool = False) nl.ndarray#

1D Convolution operation using tensor engine with replication strategy.

Parameters:
  • x_in (nl.ndarray) – [B, C_in, L], Input tensor on HBM.

  • filters (nl.ndarray) – [K, C_in, C_out], Convolution filter weights on HBM.

  • bias (Optional[nl.ndarray]) – [C_out], Optional bias tensor on HBM. Default None.

  • stride (int) – Stride for convolution. Must be >= 1. Default 1.

  • padding (tuple[int, int]) – Tuple of (left_pad, right_pad). Must be non-negative. Default (0, 0).

  • dilation (int) – Dilation factor for dilated convolution. Must be >= 1. Default 1.

  • activation_fn (Optional[ActFnType]) – Optional activation function to fuse. Default None.

  • lnc_shard (bool) – If True, shard computation across LNC cores on C_out dimension. Default False.

Returns:

[B, C_out, L_out], Output tensor on HBM where L_out = (L + pad_left + pad_right - dilation * (K - 1) - 1) // stride + 1

Return type:

nl.ndarray

Notes:

  • All input tensors (x_in, filters, bias) must have the same dtype

  • Input channels C_in must match filter channels

  • Uses replication strategy to stack K filter positions along partition dimension

  • Partition alignment rules limit K replication factor based on C_in tile size

  • Memory management uses SbufManager with multi-buffering for efficiency

Dimensions:

  • B: Batch size

  • C_in: Number of input channels

  • C_out: Number of output channels

  • L: Input sequence length

  • L_out: Output sequence length = (L + pad_left + pad_right - dilation * (K - 1) - 1) // stride + 1

This document is relevant for: Trn1, Trn2, Trn3