This document is relevant for: Inf2, Trn1, Trn1n

NeuronCore Allocation and Model Placement for Inference (torch-neuronx)#

This programming guide describes the how to allocate NeuronCores to processes and place models onto specific NeuronCores. The models in this guide are expected to have been traced with with torch_neuronx.trace().

Warning

This guide is not applicable to NeuronCore placement using XLA LazyTensor device execution. See: Comparison of Traced Inference versus XLA Lazy Tensor Inference (torch-neuronx)

In order of precedence, the recommendation is to use the following placement techniques:

  1. For nearly all regular models, default core placement should be used to take control of all cores for a single process.

  2. For applications using multiple processes, default core placement should be used in conjunction with NEURON_RT_NUM_CORES (Default Core Allocation & Placement)

  3. For more granular control, then the beta explicit placement APIs may be used (Explicit Core Placement [Beta]).

The following guide will assume a machine with 8 NeuronCores:

  • NeuronCores will use the notation nc0, nc1, etc.

  • Models will use the notation m0, m1 etc.

NeuronCores and model allocations will be displayed in the following format:

Process Allocated NeuronCores Models nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7 NeuronCores

The actual cores that are visible to the process can be adjusted according to the NeuronX Runtime Configuration.

Unlike torch-neuron (with neuron-cc) instances, torch-neuronx (with neuronx-cc) does not support NeuronCore Pipeline. This simplifies model core allocations since it means that model pipelines will likely not span across multiple NeuronCores.

Default Core Allocation & Placement#

The most basic requirement of an inference application is to be able to place a single model on a single NeuronCore. More complex applications may use multiple NeuronCores or even multiple processes each executing different models. The important thing to note about designing an inference application is that a single NeuronCore will always be allocated to a single process. Processes do not share NeuronCores. Different configurations can be used to ensure that an application process has enough NeuronCores allocated to execute its model(s):

  • Default: A process will attempt to take ownership of all NeuronCores visible on the instance. This should be used when an instance is only running a single inference process since no other process will be allowed to take ownership of any NeuronCores.

  • NEURON_RT_NUM_CORES: Specify the number of NeuronCores to allocate to the process. This places no restrictions on which NeuronCores will be used, however, the resulting NeuronCores will always be contiguous. This should be used in multi-process applications where each process should only use a subset of NeuronCores.

  • NEURON_RT_VISIBLE_CORES: Specifies exactly which NeuronCores are allocated to the process by index. Similar to NEURON_RT_NUM_CORES, this can be used in multi-process applications where each process should only use a subset of NeuronCores. This provides more fined-grained controls over the exact NeuronCores that are allocated to a given process.

See the NeuronX Runtime Configuration for more environment variable details.

Example: Default#

Python Script:

import torch
import torch_neuronx

m0 = torch.jit.load('model.pt')  # Loads to nc0
m1 = torch.jit.load('model.pt')  # Loads to nc1
allocated m0 m1 nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7

With no environment configuration, the process will take ownership of all NeuronCores. In this example, only two of the NeuronCores are used by the process and the remaining are allocated but left idle.

Example: NEURON_RT_NUM_CORES#

Environment Setup:

export NEURON_RT_NUM_CORES = '2'

Python Script:

import torch
import torch_neuronx

m0 = torch.jit.load('model.pt')  # Loads to nc0
m1 = torch.jit.load('model.pt')  # Loads to nc1
allocated unallocated m0 m1 nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7

Since there is no other process on the instance, only the first 2 NeuronCores will be acquired by the process. Models load in a simple linear order to the least used NeuronCores.

Example: NEURON_RT_VISIBLE_CORES#

Environment Setup:

export NEURON_RT_VISIBLE_CORES = '4-5'

Python Script:

import torch
import torch_neuronx

m0 = torch.jit.load('model.pt')  # Loads to nc4
m1 = torch.jit.load('model.pt')  # Loads to nc5
allocated unallocated unallocated m0 m1 nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7

Unlike NEURON_RT_NUM_CORES, setting the visible NeuronCores allows the process to take control of a specific contiguous set. This allows an application to have a more fine-grained control of where models will be placed.

Example: Multiple Processes#

Environment Setup:

export NEURON_RT_NUM_CORES = '2'

Python Script:

import torch
import torch_neuronx

m0 = torch.jit.load('model.pt')  # Loads to nc0
m1 = torch.jit.load('model.pt')  # Loads to nc1

In this example, if the script is run twice, the following allocations will be made:

proc0 proc1 unallocated m0 m1 m0 m1 nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7

Note that each process will take ownership of as many NeuronCores as is specified by the NEURON_RT_NUM_CORES configuration.

Explicit Core Placement [Beta]#

The torch_neuronx framework allows can be found in the PyTorch NeuronX NeuronCore Placement APIs [Beta] documentation.

Example: Manual Core Selection#

The most direct usage of the placement APIs is to manually select the start NeuronCore that each model is loaded to.

Environment Setup:

export NEURON_RT_NUM_CORES = '4'

Python Script:

import torch
import torch_neuronx

# NOTE: Order of loads does NOT matter
with torch_neuronx.experimental.neuron_cores_context(start_nc=3):
    m0 = torch.jit.load('model.pt')  # Loads to nc3

with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=2):
    m1 = torch.jit.load('model.pt')  # Loads replicas to nc0 and nc1

example = torch.rand(1, 3, 224, 224)

m1(example)  # Executes on nc3
m1(example)  # Executes on nc3

m0(example)  # Executes on nc0
m0(example)  # Executes on nc1
m0(example)  # Executes on nc0
allocated unallocated m1 m1 m0 nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7

Example: Automatic Multicore#

Using explicit core placement it is possible to replicate a model to multiple NeuronCores simultaneously. This means that a single model object within python can utilize all available NeuronCores (or NeuronCores allocated to the process).

Environment Setup:

export NEURON_RT_NUM_CORES = '8'

Python Script:

import torch
import torch_neuronx

with torch_neuronx.experimental.multicore_context():
    m0 = torch.jit.load('model.pt')  # Loads replications to nc0-nc7

example = torch.rand(1, 3, 224, 224)

m0(example)  # Executes on nc0
m0(example)  # Executes on nc1
allocated m0 m0 m0 m0 m0 m0 m0 m0 nc0 nc1 nc2 nc3 nc4 nc5 nc6 nc7

To make full use of a model that has been loaded to multiple NeuronCores, multiple threads should be used to run inferences in parallel.

This document is relevant for: Inf2, Trn1, Trn1n