This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Work with NEFF Files#

NEFF Architecture#

Overview#

A NEFF (Neuron Executable File Format) is a Neuron Runtime executable file generated by the Neuron compiler describing a compute graph (typically a neural network model). While each NEFF is always a single file, at its core, the NEFF is just a tarball of all the metadata needed to run the described compute graph.

Packaging#

At its core, the NEFF is just a file with a Header prepended onto a Tarball. Unpacking the NEFF and examining its contents is as straightforward as stripping the header from the file and untaring the header-stripped buffer. As part of the Neuron devtools suite, we have a neuron-packager tool that can be used to unpack a NEFF:

neuron-packager unpack file.neff

NEFF Header#

The NEFF header is a 1024 byte buffer prepended onto the NEFF tarball:

typedef struct neff_header {
    uint64_t pkg_version;
    uint64_t header_size;
    uint64_t data_size;
    uint64_t neff_version_major;
    uint64_t neff_version_minor;
    uint8_t neff_build_version[128];
    uint32_t num_tpb;
    uint8_t hash[32];
    uint8_t uuid[16];
    char name[256];
    uint32_t requested_tpb_count;
    uint8_t tpb_per_node[64];
    uint64_t feature_bits;
    uint32_t lnc_size;
    uint8_t pad[468];
    uint8_t data[];
} neff_header_t;

Its contents are described below:

  • uint64_t pkg_version

    Tool version used to create this NEFF

  • uint64_t header_size

    Number of bytes contained in this header

  • uint64_t data_size

    Size in bytes of the NEFF contents

  • uint64_t neff_version_major

    NEFF major version

  • uint64_t neff_version_minor

    NEFF minor version

  • uint8_t neff_build_version[128]

    Build version information

  • uint32_t num_tpb

    Total number of TPBs required for efficient execution (all SGs get their own TPB)

  • uint8_t hash[NEFF_HEADER_HASH_SZ]

    Hash of the package, sha256 or md5 depending on the pkg_version

  • uint8_t uuid[NEFF_HEADER_UUID_SZ]

    Unique identifier for the NEFF

  • char name[NEFF_HEADER_NAME_SZ]

    Name of the NEFF

  • uint32_t requested_tpb_count

    How many TPBs were requested during compilation

  • uint8_t tpb_per_node[MAX_NODES]

    Number of required TPBs per kelf node in the graph, 1 byte per node

  • uint64_t feature_bits

    Bits representing individual incompatible NEFF features for fine-grained compatibility checking

  • uint32_t lnc_size

    Logical core size required to run this NEFF

Tarball#

The NEFF tarball, when unpacked, consists of top-level JSON files describing the graph as a whole and partitioned subgraphs.

Components#

Subgraphs (sg00 … sgN)#

A subgraph is a directory in the unpackaged NEFF that contains files which describe the computation and resources needed to run a “subgraph”. When the NEFF is loaded, each subgraph declared in the NEFF will be loaded onto its own TPB. In the past on INF1, a NEFF could contain multiple subgraphs with data being passed between subgraphs to improve model throughput. This feature was called serial TPB. The serial TPB feature does not exist (not needed) on architectures after INF1. Today multiple subgraphs in a NEFF tie to the logical core feature.

def.json#

The def.json file is the starting file for any subgraph. At its top level, it will point the runtime to the engine JSONs and the engine binaries as well as declare queue sets and variables used by the subgraph to move and hold data.

Queue Sets#

Queue sets declared in def.json will be mapped to physical HW queues by the runtime during model load. These queue sets will be used to move data during NEFF execution and are declared in the dma_queue object. Each queue set is a JSON object that can contain the following fields:

  • type (required)
    • Type: string

    • Description: What this queue set will be used for

    • Valid values: in, out, data, embedding_update, dynamic

    • Supported architectures: all

  • num_queues (optional)
    • Type: int

    • Description: Number of HW queues to reserve for this queue set. More HW queues allow the NEFF program to use multiple DMA engines to transfer data

    • Restrictions: On INF1, this field must be 1. On non-INF1 platforms it must be 16 or less

    • Default: 1

    • Supported architectures: all

  • owner (optional)
    • Type: string

    • Description: Engine that owns this queue set. When queue set is assigned to an engine, the owning engine will perform the DMA triggers of the queue set

    • Supported architectures: all

  • pinned (optional)
    • Type: bool

    • Description: Queue is used to move data to the TPB’s state buffer during model start. Once the data is moved to SB, the NEFF will never write to the buffers “pinning” the data to SB

    • Default: false

    • Supported architectures: INF1

  • queue_instances (optional)
    • Type: [string]

    • Description: Set of DMA rings that can be swapped in/out during model execution

    • Supported architectures: all architectures except for INF1

  • semaphore_set (optional)
    • Type: [int]

    • Description: A list of semaphores used by the queue set to signal data transfer completion

    • Supported architectures: all

  • semaphore (optional)
    • Type: int

    • Description: Single semaphore used by single queue to signal data transfer completion

    • Supported architectures: all

  • fabric_path (optional)
    • Type: string

    • Description: Which pathway the DMA queue should take to move data

    • Valid values: main, alt

    • Default: “main”

    • Supported architectures: all

Variables#

Variables are buffers allocated on device that can be referenced by the NEFF to read data from and write data to during execution. Variables are declared in the var object in def.json and each variable is a JSON object that can contain the following fields:

  • type (required)
    • Type: string

    • Description: What type of data this variable contains

    • Valid values: state-buffer, input, output, file (HBM), tmp-buf (HBM) - private-per-NEFF scratchpad allocation, virtual (HBM) - shared scratchpad variables, pointer (HBM), dge-table

    • Supported architectures: all

  • var_id (required)
    • Type: int

    • Description: Unique ID to reference this variable with

    • Restrictions: Must be unique to this variable

    • Supported architectures: all

  • size (required)
    • Type: int

    • Description: Size in bytes of the variable

    • Supported architectures: all

  • alignment (optional)
    • Type: int

    • Description: Physical address alignment for this variable

    • Restrictions: Must be a power of two

    • Default: 0

    • Supported architectures: all

  • fabric_path (optional)
    • Type: string

    • Description: Fabric path to place this variable on

    • Default: “main”

    • Supported architectures: all

  • file_name (optional)
    • Type: string

    • Description: File to load variable data from. Can point to .npy files or raw binary data (any file without a .npy extension)

    • Restrictions: Only used with variable type file

    • Supported architectures: all

  • backing_variable_off (optional)
    • Type: int

    • Description: The offset inside the shared scratchpad space allocated by Runtime

    • Restrictions: Only used with variable type virtual

    • Supported architectures: all

  • referenced_var_id (optional)
    • Type: int

    • Description: var_id of the variable whose address will be placed in this pointer variable

    • Restrictions: Only used with variable type pointer

    • Supported architectures: all

  • list (optional)
    • Type: [int]

    • Description: List of var_ids to populate the table with

    • Restrictions: Used with variable type dge-table

    • Supported architectures: all

{ENGINE}.json#

The engine JSON is a JSON for each of the TPB’s engines. This JSON will describe the DMA descriptors triggered by the engine to move data during execution as well as some extra engine-specific metadata.

DMA Descriptors#

In each engine JSON file, there is a list of JSON objects describing DMA data movements triggered by the engine. This list is indexed by the dma key. Each object in the list is a JSON object with the following fields:

  • id (required)
    • Type: int

    • Description: Identifier to map this descriptor to a trigger in the engine binary. Other descriptors with the same ID in the same function call must have the same trigger amounts

    • Supported architectures: all

  • queue (required if instance_name is empty)
    • Type: string

    • Description: Name of the queue set this descriptor will be placed on

    • Restrictions: instance_name field takes precedence over this field

    • Supported architectures: all

  • instance_name (required if queue field is empty)
    • Type: string

    • Description: Name of the queue set instance this descriptor will be placed on

    • Restrictions: Takes precedence over queue field

    • Supported architectures: everything but INF1

  • function_start (optional)
    • Type: string

    • Description: Names the function that will trigger this descriptor and all other descriptors after it belonging to the same queue set until the next function_start for the queue set is hit. Used in the call graph flow feature of the compiler

    • Restrictions: Must name a valid function declared in the engine binary

    • Default: “”

    • Supported architectures: everything but INF1

  • section_start_desc (optional)
    • Type: bool

    • Description: If this field is true, the runtime will place this descriptor on the first queue in the queue set

    • Default: false

    • Supported architectures: everything but INF1

  • event (optional)
    • Type: int

    • Description: Event to set after this descriptor has been executed

    • Supported architectures: all

  • semaphore (optional)
    • Type: int

    • Description: Semaphore to increment after this descriptor has been executed. If there are multiple queues in the queue set, semaphore will be incremented by num_queues amount when the transfer is complete

    • Supported architectures: all

  • remote_semaphores (optional)
    • Type: [int]

    • Description: Semaphore(s) of other NeuronCore/TPB to increment in case of LNC size 2

    • Supported architectures: Trn2 and above

  • desc (required)

    Contains the following sub-fields:

    • op (optional)
      • Type: string

      • Description: Op for the DMA engine to perform for this transfer

      • Valid values: fma, cast, add, min, max, transpose, copy

      • Default: “copy”

      • Supported architectures: everything but INF1

    • from/to (to is always required; from is required for non-CCE descriptors)
      • Type: string

      • Description: Which variable to read from/write to

      • Restrictions: Must be a variable declared in def.json

      • Supported architectures: all

    • from_off/to_off (required)
      • Type: int

      • Description: Offset in variable to read/write

      • Supported architectures: all

    • from_steps/to_steps (required)
      • Type: Array[int]

      • Description: Access pattern steps for variable. All elements are in denominations of bytes. The first element - corresponding to the innermost/fastest-growing dimension - is usually 1 to indicate that successive bytes must be copied

      • Restrictions: Max array size of 4; array-length must match {from, to}_sizes

      • Supported architectures: all

    • from_sizes/to_sizes (required)
      • Type: Array[int]

      • Description: Access pattern sizes for variable. The first element - corresponding to the innermost/fastest-growing dimension - is in denomination of bytes. All other elements are counts of number of elements in those dimensions

      • Restrictions: Max array-size of 4; array length must match {from, to}_steps

      • Supported architectures: all

    • from_dtype/to_dtype (optional)
      • Type: string

      • Description: Dtype of variable

      • Valid values: float8e3, float8e4, float8e5, float16, float32, float32r, bfloat16, uint8, uint16, uint32, uint64, int8, int16, int32, int64

      • Default: “uint8”

      • Supported architectures: everything but INF1

    • num_tiling_dimensions (optional)
      • Type: int

      • Description: Number of dimensions used to tile the DMA descriptor; number of dimensions used for a single tile

      • Supported architectures: all

    • from_arr (required for CCE descriptors, replaces from* fields)
      • Type: [objects]

      • Description: List of source tensors to perform the CCE op on; fields are the “from” parts of a DMA descriptor

      • Restrictions: Cannot have more than 16 source tensors (length of from_arr <= 16)

      • Supported architectures: everything but INF1

    FMA only fields:

    • scale_dtype (required with fma op)
      • Type: string

      • Description: Data type of the scale constant

      • Valid values: float32

      • Default: “float32”

    • scale (optional)
      • Type: double

      • Description: Scale for data being moved

      • Restrictions: Only valid on “fma” type descriptors

      • Default: 1.0

      • Supported architectures: everything but INF1

    Min/Max only fields:

    • constant_dtype (optional)
      • Type: string

      • Description: Datatype of “min”/”max” constant

      • Valid values: float32, int32, uint32

    • constant (required if constant_dtype specified)
      • Type: double, int, or uint

      • Description: Constant to start the “min”/”max” operation with

      • Restrictions: Only valid on “min”/”max” type descriptors; will be ignored if constant_dtype is not specified

      • Supported architectures: everything but INF1

    Transpose only fields:

    • transpose_shape (optional)
      • Type: [int]

      • Description: Shape to transpose the data to

      • Restrictions: Number of elements must be XPOSE_NUM_DIMS (4)

      • Supported architectures: everything but INF1

    • transpose_element_size (optional)
      • Type: int

      • Description: Size of a single element of the transpose

      • Supported architectures: everything but INF1

Activation.json#

In addition to DMA descriptors, the Activation.json file also contains metadata on the PWP tables used by the NEFF. The field activation_function_sets lists the activation function sets used during execution of the NEFF. Each “activation function set” will point to activation table metadata contained in the subgraph’s directory.

Example:

"activation_function_sets": [
    "reciprocal_sqrt_and_small",
    "natural_log_exp_and_others",
    "reciprocal_and_small",
    "gelu_and_others"
]

DVE.json#

DVE.json will contain info about the DVE tables used by the NEFF. In DVE.json, the DVE tables will be indexed by the dve_tables key. More info on this works can be found in the Loadable DVE doc.

Example:

"dve_tables": [
    {
        "control_table": "default_control_table.bin",
        "datapath_table": "default_datapath_table.bin",
        "opcode_table": "default_opcode_table.bin"
    }
]

Constants#

Constants are data files placed directly in the subgraph directory. These files can be referenced from a variable declared in def.json. During model load time, the contents of these files are written into the variable declared for it. The data is either written as raw binary for most files, or, for npy files, the file will be parsed and the numpy array data will be written to the buffer. These are pointed to by the file_name field in var declarations.

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3