This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Work with NEFF Files#

NEFF Architecture#

Overview#

A NEFF (Neuron Executable File Format) is a Neuron Runtime executable file generated by the Neuron compiler describing a compute graph (typically a neural network model). While each NEFF is always a single file, at its core, the NEFF is just a tarball of all the metadata needed to run the described compute graph.

Packaging#

At its core, the NEFF is just a file with a Header prepended onto a Tarball. Unpacking the NEFF and examining its contents is as straightforward as stripping the header from the file and untaring the header-stripped buffer. As part of the Neuron devtools suite, we have a neuron-packager tool that can be used to unpack a NEFF:

neuron-packager unpack file.neff

NEFF Header#

The NEFF header is a 1024 byte buffer prepended onto the NEFF tarball:

typedef struct neff_header {
    uint64_t pkg_version;
    uint64_t header_size;
    uint64_t data_size;
    uint64_t neff_version_major;
    uint64_t neff_version_minor;
    uint8_t neff_build_version[128];
    uint32_t num_tpb;
    uint8_t hash[32];
    uint8_t uuid[16];
    char name[256];
    uint32_t requested_tpb_count;
    uint8_t tpb_per_node[64];
    uint64_t feature_bits;
    uint32_t lnc_size;
    uint8_t pad[468];
    uint8_t data[];
} neff_header_t;

Its contents are described below:

uint64_t pkg_version
Tool version used to create this NEFF
uint64_t header_size
Number of bytes contained in this header
uint64_t data_size
Size in bytes of the NEFF contents
uint64_t neff_version_major
NEFF major version
uint64_t neff_version_minor
NEFF minor version
uint8_t neff_build_version[128]
Build version information
uint32_t num_tpb
Total number of TPBs required for efficient execution (all SGs get their own TPB)
uint8_t hash[NEFF_HEADER_HASH_SZ]
Hash of the package, sha256 or md5 depending on the pkg_version
uint8_t uuid[NEFF_HEADER_UUID_SZ]
Unique identifier for the NEFF
char name[NEFF_HEADER_NAME_SZ]
Name of the NEFF
uint32_t requested_tpb_count
How many TPBs were requested during compilation
uint8_t tpb_per_node[MAX_NODES]
Number of required TPBs per kelf node in the graph, 1 byte per node
uint64_t feature_bits
Bits representing individual incompatible NEFF features for fine-grained compatibility checking
uint32_t lnc_size
Logical core size required to run this NEFF

Tarball#

The NEFF tarball, when unpacked, consists of top-level JSON files describing the graph as a whole and partitioned subgraphs.

Components#

Subgraphs (sg00 … sgN)#

A subgraph is a directory in the unpackaged NEFF that contains files which describe the computation and resources needed to run a “subgraph”. When the NEFF is loaded, each subgraph declared in the NEFF will be loaded onto its own TPB. In the past on INF1, a NEFF could contain multiple subgraphs with data being passed between subgraphs to improve model throughput. This feature was called serial TPB. The serial TPB feature does not exist (not needed) on architectures after INF1. Today multiple subgraphs in a NEFF tie to the logical core feature.

def.json#

The def.json file is the starting file for any subgraph. At its top level, it will point the runtime to the engine JSONs and the engine binaries as well as declare queue sets and variables used by the subgraph to move and hold data.

Queue Sets#

Queue sets declared in def.json will be mapped to physical HW queues by the runtime during model load. These queue sets will be used to move data during NEFF execution and are declared in the dma_queue object. Each queue set is a JSON object that can contain the following fields:

type (required)
- Type: string
- Description: What this queue set will be used for
- Valid values: in, out, data, embedding_update, dynamic
- Supported architectures: all
num_queues (optional)
- Type: int
- Description: Number of HW queues to reserve for this queue set. More HW queues allow the NEFF program to use multiple DMA engines to transfer data
- Restrictions: On INF1, this field must be 1. On non-INF1 platforms it must be 16 or less
- Default: 1
- Supported architectures: all
owner (optional)
- Type: string
- Description: Engine that owns this queue set. When queue set is assigned to an engine, the owning engine will perform the DMA triggers of the queue set
- Supported architectures: all
pinned (optional)
- Type: bool
- Description: Queue is used to move data to the TPB’s state buffer during model start. Once the data is moved to SB, the NEFF will never write to the buffers “pinning” the data to SB
- Default: false
- Supported architectures: INF1
queue_instances (optional)
- Type: [string]
- Description: Set of DMA rings that can be swapped in/out during model execution
- Supported architectures: all architectures except for INF1
semaphore_set (optional)
- Type: [int]
- Description: A list of semaphores used by the queue set to signal data transfer completion
- Supported architectures: all
semaphore (optional)
- Type: int
- Description: Single semaphore used by single queue to signal data transfer completion
- Supported architectures: all
fabric_path (optional)
- Type: string
- Description: Which pathway the DMA queue should take to move data
- Valid values: main, alt
- Default: “main”
- Supported architectures: all

Variables#

Variables are buffers allocated on device that can be referenced by the NEFF to read data from and write data to during execution. Variables are declared in the var object in def.json and each variable is a JSON object that can contain the following fields:

type (required)
- Type: string
- Description: What type of data this variable contains
- Valid values: state-buffer, input, output, file (HBM), tmp-buf (HBM) - private-per-NEFF scratchpad allocation, virtual (HBM) - shared scratchpad variables, pointer (HBM), dge-table
- Supported architectures: all
var_id (required)
- Type: int
- Description: Unique ID to reference this variable with
- Restrictions: Must be unique to this variable
- Supported architectures: all
size (required)
- Type: int
- Description: Size in bytes of the variable
- Supported architectures: all
alignment (optional)
- Type: int
- Description: Physical address alignment for this variable
- Restrictions: Must be a power of two
- Default: 0
- Supported architectures: all
fabric_path (optional)
- Type: string
- Description: Fabric path to place this variable on
- Default: “main”
- Supported architectures: all
file_name (optional)
- Type: string
- Description: File to load variable data from. Can point to .npy files or raw binary data (any file without a .npy extension)
- Restrictions: Only used with variable type file
- Supported architectures: all
backing_variable_off (optional)
- Type: int
- Description: The offset inside the shared scratchpad space allocated by Runtime
- Restrictions: Only used with variable type virtual
- Supported architectures: all
referenced_var_id (optional)
- Type: int
- Description: var_id of the variable whose address will be placed in this pointer variable
- Restrictions: Only used with variable type pointer
- Supported architectures: all
list (optional)
- Type: [int]
- Description: List of var_ids to populate the table with
- Restrictions: Used with variable type dge-table
- Supported architectures: all

{ENGINE}.json#

The engine JSON is a JSON for each of the TPB’s engines. This JSON will describe the DMA descriptors triggered by the engine to move data during execution as well as some extra engine-specific metadata.

DMA Descriptors#

In each engine JSON file, there is a list of JSON objects describing DMA data movements triggered by the engine. This list is indexed by the dma key. Each object in the list is a JSON object with the following fields:

id (required)
- Type: int
- Description: Identifier to map this descriptor to a trigger in the engine binary. Other descriptors with the same ID in the same function call must have the same trigger amounts
- Supported architectures: all
queue (required if instance_name is empty)
- Type: string
- Description: Name of the queue set this descriptor will be placed on
- Restrictions: instance_name field takes precedence over this field
- Supported architectures: all
instance_name (required if queue field is empty)
- Type: string
- Description: Name of the queue set instance this descriptor will be placed on
- Restrictions: Takes precedence over queue field
- Supported architectures: everything but INF1
function_start (optional)
- Type: string
- Description: Names the function that will trigger this descriptor and all other descriptors after it belonging to the same queue set until the next function_start for the queue set is hit. Used in the call graph flow feature of the compiler
- Restrictions: Must name a valid function declared in the engine binary
- Default: “”
- Supported architectures: everything but INF1
section_start_desc (optional)
- Type: bool
- Description: If this field is true, the runtime will place this descriptor on the first queue in the queue set
- Default: false
- Supported architectures: everything but INF1
event (optional)
- Type: int
- Description: Event to set after this descriptor has been executed
- Supported architectures: all
semaphore (optional)
- Type: int
- Description: Semaphore to increment after this descriptor has been executed. If there are multiple queues in the queue set, semaphore will be incremented by num_queues amount when the transfer is complete
- Supported architectures: all
remote_semaphores (optional)
- Type: [int]
- Description: Semaphore(s) of other NeuronCore/TPB to increment in case of LNC size 2
- Supported architectures: Trn2 and above
desc (required)
Contains the following sub-fields:
- op (optional)
  
  Type: string
  
  Description: Op for the DMA engine to perform for this transfer
  
  Valid values: fma, cast, add, min, max, transpose, copy
  
  Default: “copy”
  
  Supported architectures: everything but INF1
- from/to (to is always required; from is required for non-CCE descriptors)
  
  Type: string
  
  Description: Which variable to read from/write to
  
  Restrictions: Must be a variable declared in def.json
  
  Supported architectures: all
- from_off/to_off (required)
  
  Type: int
  
  Description: Offset in variable to read/write
  
  Supported architectures: all
- from_steps/to_steps (required)
  
  Type: Array[int]
  
  Description: Access pattern steps for variable. All elements are in denominations of bytes. The first element - corresponding to the innermost/fastest-growing dimension - is usually 1 to indicate that successive bytes must be copied
  
  Restrictions: Max array size of 4; array-length must match {from, to}_sizes
  
  Supported architectures: all
- from_sizes/to_sizes (required)
  
  Type: Array[int]
  
  Description: Access pattern sizes for variable. The first element - corresponding to the innermost/fastest-growing dimension - is in denomination of bytes. All other elements are counts of number of elements in those dimensions
  
  Restrictions: Max array-size of 4; array length must match {from, to}_steps
  
  Supported architectures: all
- from_dtype/to_dtype (optional)
  
  Type: string
  
  Description: Dtype of variable
  
  Valid values: float8e3, float8e4, float8e5, float16, float32, float32r, bfloat16, uint8, uint16, uint32, uint64, int8, int16, int32, int64
  
  Default: “uint8”
  
  Supported architectures: everything but INF1
- num_tiling_dimensions (optional)
  
  Type: int
  
  Description: Number of dimensions used to tile the DMA descriptor; number of dimensions used for a single tile
  
  Supported architectures: all
- from_arr (required for CCE descriptors, replaces from* fields)
  
  Type: [objects]
  
  Description: List of source tensors to perform the CCE op on; fields are the “from” parts of a DMA descriptor
  
  Restrictions: Cannot have more than 16 source tensors (length of from_arr <= 16)
  
  Supported architectures: everything but INF1
FMA only fields:
- scale_dtype (required with fma op)
  
  Type: string
  
  Description: Data type of the scale constant
  
  Valid values: float32
  
  Default: “float32”
- scale (optional)
  
  Type: double
  
  Description: Scale for data being moved
  
  Restrictions: Only valid on “fma” type descriptors
  
  Default: 1.0
  
  Supported architectures: everything but INF1
Min/Max only fields:
- constant_dtype (optional)
  
  Type: string
  
  Description: Datatype of “min”/”max” constant
  
  Valid values: float32, int32, uint32
- constant (required if constant_dtype specified)
  
  Type: double, int, or uint
  
  Description: Constant to start the “min”/”max” operation with
  
  Restrictions: Only valid on “min”/”max” type descriptors; will be ignored if constant_dtype is not specified
  
  Supported architectures: everything but INF1
Transpose only fields:
- transpose_shape (optional)
  
  Type: [int]
  
  Description: Shape to transpose the data to
  
  Restrictions: Number of elements must be XPOSE_NUM_DIMS (4)
  
  Supported architectures: everything but INF1
- transpose_element_size (optional)
  
  Type: int
  
  Description: Size of a single element of the transpose
  
  Supported architectures: everything but INF1

Activation.json#

In addition to DMA descriptors, the Activation.json file also contains metadata on the PWP tables used by the NEFF. The field activation_function_sets lists the activation function sets used during execution of the NEFF. Each “activation function set” will point to activation table metadata contained in the subgraph’s directory.

Example:

"activation_function_sets": [
    "reciprocal_sqrt_and_small",
    "natural_log_exp_and_others",
    "reciprocal_and_small",
    "gelu_and_others"
]

DVE.json#

DVE.json will contain info about the DVE tables used by the NEFF. In DVE.json, the DVE tables will be indexed by the dve_tables key. More info on this works can be found in the Loadable DVE doc.