This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Work with NEFF Files#
NEFF Architecture#
Overview#
A NEFF (Neuron Executable File Format) is a Neuron Runtime executable file generated by the Neuron compiler describing a compute graph (typically a neural network model). While each NEFF is always a single file, at its core, the NEFF is just a tarball of all the metadata needed to run the described compute graph.
Packaging#
At its core, the NEFF is just a file with a Header prepended onto a Tarball. Unpacking the NEFF and examining its contents is as straightforward as stripping the header from the file and untaring the header-stripped buffer. As part of the Neuron devtools suite, we have a neuron-packager tool that can be used to unpack a NEFF:
neuron-packager unpack file.neff
NEFF Header#
The NEFF header is a 1024 byte buffer prepended onto the NEFF tarball:
typedef struct neff_header {
uint64_t pkg_version;
uint64_t header_size;
uint64_t data_size;
uint64_t neff_version_major;
uint64_t neff_version_minor;
uint8_t neff_build_version[128];
uint32_t num_tpb;
uint8_t hash[32];
uint8_t uuid[16];
char name[256];
uint32_t requested_tpb_count;
uint8_t tpb_per_node[64];
uint64_t feature_bits;
uint32_t lnc_size;
uint8_t pad[468];
uint8_t data[];
} neff_header_t;
Its contents are described below:
uint64_t pkg_versionTool version used to create this NEFF
uint64_t header_sizeNumber of bytes contained in this header
uint64_t data_sizeSize in bytes of the NEFF contents
uint64_t neff_version_majorNEFF major version
uint64_t neff_version_minorNEFF minor version
uint8_t neff_build_version[128]Build version information
uint32_t num_tpbTotal number of TPBs required for efficient execution (all SGs get their own TPB)
uint8_t hash[NEFF_HEADER_HASH_SZ]Hash of the package, sha256 or md5 depending on the pkg_version
uint8_t uuid[NEFF_HEADER_UUID_SZ]Unique identifier for the NEFF
char name[NEFF_HEADER_NAME_SZ]Name of the NEFF
uint32_t requested_tpb_countHow many TPBs were requested during compilation
uint8_t tpb_per_node[MAX_NODES]Number of required TPBs per kelf node in the graph, 1 byte per node
uint64_t feature_bitsBits representing individual incompatible NEFF features for fine-grained compatibility checking
uint32_t lnc_sizeLogical core size required to run this NEFF
Tarball#
The NEFF tarball, when unpacked, consists of top-level JSON files describing the graph as a whole and partitioned subgraphs.
Components#
Subgraphs (sg00 … sgN)#
A subgraph is a directory in the unpackaged NEFF that contains files which describe the computation and resources needed to run a “subgraph”. When the NEFF is loaded, each subgraph declared in the NEFF will be loaded onto its own TPB. In the past on INF1, a NEFF could contain multiple subgraphs with data being passed between subgraphs to improve model throughput. This feature was called serial TPB. The serial TPB feature does not exist (not needed) on architectures after INF1. Today multiple subgraphs in a NEFF tie to the logical core feature.
def.json#
The def.json file is the starting file for any subgraph. At its top level, it will point the runtime to the engine JSONs and the engine binaries as well as declare queue sets and variables used by the subgraph to move and hold data.
Queue Sets#
Queue sets declared in def.json will be mapped to physical HW queues by the runtime during model load. These queue sets will be used to move data during NEFF execution and are declared in the dma_queue object. Each queue set is a JSON object that can contain the following fields:
type(required)Type: string
Description: What this queue set will be used for
Valid values:
in,out,data,embedding_update,dynamicSupported architectures: all
num_queues(optional)Type: int
Description: Number of HW queues to reserve for this queue set. More HW queues allow the NEFF program to use multiple DMA engines to transfer data
Restrictions: On INF1, this field must be 1. On non-INF1 platforms it must be 16 or less
Default: 1
Supported architectures: all
owner(optional)Type: string
Description: Engine that owns this queue set. When queue set is assigned to an engine, the owning engine will perform the DMA triggers of the queue set
Supported architectures: all
pinned(optional)Type: bool
Description: Queue is used to move data to the TPB’s state buffer during model start. Once the data is moved to SB, the NEFF will never write to the buffers “pinning” the data to SB
Default: false
Supported architectures: INF1
queue_instances(optional)Type: [string]
Description: Set of DMA rings that can be swapped in/out during model execution
Supported architectures: all architectures except for INF1
semaphore_set(optional)Type: [int]
Description: A list of semaphores used by the queue set to signal data transfer completion
Supported architectures: all
semaphore(optional)Type: int
Description: Single semaphore used by single queue to signal data transfer completion
Supported architectures: all
fabric_path(optional)Type: string
Description: Which pathway the DMA queue should take to move data
Valid values:
main,altDefault: “main”
Supported architectures: all
Variables#
Variables are buffers allocated on device that can be referenced by the NEFF to read data from and write data to during execution. Variables are declared in the var object in def.json and each variable is a JSON object that can contain the following fields:
type(required)Type: string
Description: What type of data this variable contains
Valid values:
state-buffer,input,output,file(HBM),tmp-buf(HBM) - private-per-NEFF scratchpad allocation,virtual(HBM) - shared scratchpad variables,pointer(HBM),dge-tableSupported architectures: all
var_id(required)Type: int
Description: Unique ID to reference this variable with
Restrictions: Must be unique to this variable
Supported architectures: all
size(required)Type: int
Description: Size in bytes of the variable
Supported architectures: all
alignment(optional)Type: int
Description: Physical address alignment for this variable
Restrictions: Must be a power of two
Default: 0
Supported architectures: all
fabric_path(optional)Type: string
Description: Fabric path to place this variable on
Default: “main”
Supported architectures: all
file_name(optional)Type: string
Description: File to load variable data from. Can point to .npy files or raw binary data (any file without a .npy extension)
Restrictions: Only used with variable type
fileSupported architectures: all
backing_variable_off(optional)Type: int
Description: The offset inside the shared scratchpad space allocated by Runtime
Restrictions: Only used with variable type
virtualSupported architectures: all
referenced_var_id(optional)Type: int
Description:
var_idof the variable whose address will be placed in this pointer variableRestrictions: Only used with variable type
pointerSupported architectures: all
list(optional)Type: [int]
Description: List of
var_idsto populate the table withRestrictions: Used with variable type
dge-tableSupported architectures: all
{ENGINE}.json#
The engine JSON is a JSON for each of the TPB’s engines. This JSON will describe the DMA descriptors triggered by the engine to move data during execution as well as some extra engine-specific metadata.
DMA Descriptors#
In each engine JSON file, there is a list of JSON objects describing DMA data movements triggered by the engine. This list is indexed by the dma key. Each object in the list is a JSON object with the following fields:
id(required)Type: int
Description: Identifier to map this descriptor to a trigger in the engine binary. Other descriptors with the same ID in the same function call must have the same trigger amounts
Supported architectures: all
queue(required ifinstance_nameis empty)Type: string
Description: Name of the queue set this descriptor will be placed on
Restrictions:
instance_namefield takes precedence over this fieldSupported architectures: all
instance_name(required ifqueuefield is empty)Type: string
Description: Name of the queue set instance this descriptor will be placed on
Restrictions: Takes precedence over
queuefieldSupported architectures: everything but INF1
function_start(optional)Type: string
Description: Names the function that will trigger this descriptor and all other descriptors after it belonging to the same queue set until the next
function_startfor the queue set is hit. Used in the call graph flow feature of the compilerRestrictions: Must name a valid function declared in the engine binary
Default: “”
Supported architectures: everything but INF1
section_start_desc(optional)Type: bool
Description: If this field is true, the runtime will place this descriptor on the first queue in the queue set
Default: false
Supported architectures: everything but INF1
event(optional)Type: int
Description: Event to set after this descriptor has been executed
Supported architectures: all
semaphore(optional)Type: int
Description: Semaphore to increment after this descriptor has been executed. If there are multiple queues in the queue set, semaphore will be incremented by
num_queuesamount when the transfer is completeSupported architectures: all
remote_semaphores(optional)Type: [int]
Description: Semaphore(s) of other NeuronCore/TPB to increment in case of LNC size 2
Supported architectures: Trn2 and above
desc(required)Contains the following sub-fields:
op(optional)Type: string
Description: Op for the DMA engine to perform for this transfer
Valid values:
fma,cast,add,min,max,transpose,copyDefault: “copy”
Supported architectures: everything but INF1
from/to(tois always required;fromis required for non-CCE descriptors)Type: string
Description: Which variable to read from/write to
Restrictions: Must be a variable declared in
def.jsonSupported architectures: all
from_off/to_off(required)Type: int
Description: Offset in variable to read/write
Supported architectures: all
from_steps/to_steps(required)Type: Array[int]
Description: Access pattern steps for variable. All elements are in denominations of bytes. The first element - corresponding to the innermost/fastest-growing dimension - is usually 1 to indicate that successive bytes must be copied
Restrictions: Max array size of 4; array-length must match
{from, to}_sizesSupported architectures: all
from_sizes/to_sizes(required)Type: Array[int]
Description: Access pattern sizes for variable. The first element - corresponding to the innermost/fastest-growing dimension - is in denomination of bytes. All other elements are counts of number of elements in those dimensions
Restrictions: Max array-size of 4; array length must match
{from, to}_stepsSupported architectures: all
from_dtype/to_dtype(optional)Type: string
Description: Dtype of variable
Valid values:
float8e3,float8e4,float8e5,float16,float32,float32r,bfloat16,uint8,uint16,uint32,uint64,int8,int16,int32,int64Default: “uint8”
Supported architectures: everything but INF1
num_tiling_dimensions(optional)Type: int
Description: Number of dimensions used to tile the DMA descriptor; number of dimensions used for a single tile
Supported architectures: all
from_arr(required for CCE descriptors, replacesfrom*fields)Type: [objects]
Description: List of source tensors to perform the CCE op on; fields are the “from” parts of a DMA descriptor
Restrictions: Cannot have more than 16 source tensors (length of
from_arr<= 16)Supported architectures: everything but INF1
FMA only fields:
scale_dtype(required with fma op)Type: string
Description: Data type of the scale constant
Valid values:
float32Default: “float32”
scale(optional)Type: double
Description: Scale for data being moved
Restrictions: Only valid on “fma” type descriptors
Default: 1.0
Supported architectures: everything but INF1
Min/Max only fields:
constant_dtype(optional)Type: string
Description: Datatype of “min”/”max” constant
Valid values:
float32,int32,uint32
constant(required ifconstant_dtypespecified)Type: double, int, or uint
Description: Constant to start the “min”/”max” operation with
Restrictions: Only valid on “min”/”max” type descriptors; will be ignored if
constant_dtypeis not specifiedSupported architectures: everything but INF1
Transpose only fields:
transpose_shape(optional)Type: [int]
Description: Shape to transpose the data to
Restrictions: Number of elements must be
XPOSE_NUM_DIMS(4)Supported architectures: everything but INF1
transpose_element_size(optional)Type: int
Description: Size of a single element of the transpose
Supported architectures: everything but INF1
Activation.json#
In addition to DMA descriptors, the Activation.json file also contains metadata on the PWP tables used by the NEFF. The field activation_function_sets lists the activation function sets used during execution of the NEFF. Each “activation function set” will point to activation table metadata contained in the subgraph’s directory.
Example:
"activation_function_sets": [
"reciprocal_sqrt_and_small",
"natural_log_exp_and_others",
"reciprocal_and_small",
"gelu_and_others"
]
DVE.json#
DVE.json will contain info about the DVE tables used by the NEFF. In DVE.json, the DVE tables will be indexed by the dve_tables key. More info on this works can be found in the Loadable DVE doc.
Example:
"dve_tables": [
{
"control_table": "default_control_table.bin",
"datapath_table": "default_datapath_table.bin",
"opcode_table": "default_opcode_table.bin"
}
]
Constants#
Constants are data files placed directly in the subgraph directory. These files can be referenced from a variable declared in def.json. During model load time, the contents of these files are written into the variable declared for it. The data is either written as raw binary for most files, or, for npy files, the file will be parsed and the numpy array data will be written to the buffer. These are pointed to by the file_name field in var declarations.
This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3