This document is relevant for: Inf1
PyTorch Neuron (torch-neuron
) Core Placement#
This programming guide describes the available techniques and APIs to be able to allocate NeuronCores to a process and place models onto specific NeuronCores. In order of precedence, the current recommendation is to use the following placement techniques:
For most regular models, default core placement should be used in conjunction with
NEURON_RT_NUM_CORES
(Default Core Allocation & Placement)For more specific core placement for NeuronCore Pipelined models, then
NEURONCORE_GROUP_SIZES
should be used (NEURONCORE_GROUP_SIZES).Finally, for even more granular control, then the beta explicit placement APIs may be used (Explicit Core Placement [Beta]).
The following guide will assume a machine with 8 NeuronCores:
NeuronCores will use the notation
nc0
,nc1
, etc.NeuronCore Groups will use the notation
ncg0
,ncg1
etc.Models will use the notation
m0
,m1
etc.
NeuronCores, NeuronCore Groups, and model allocations will be displayed in the following format:
Note that the actual cores that are visible to the process can be adjusted according to the NeuronX Runtime Configuration.
NeuronCore Pipeline#
A key concept to understand the intent behind certain core placement strategies is NeuronCore Pipelining (See NeuronCore Pipeline). NeuronCore Pipelining allows a model to be automatically split into pieces and executed on different NeuronCores.
For most models only 1 NeuronCore will be required for execution. A model will
only require more than one NeuronCore when using NeuronCore Pipeline.
When model pipelining is enabled, the model is split between multiple
NeuronCores and data is transferred between them. For example, if the compiler
flag --neuroncore-pipeline-cores 4
is used, this splits the model into
4 pieces to be executed on 4 separate NeuronCores.
Default Core Allocation & Placement#
The most basic requirement of an inference application is to be able to place a single model on a single NeuronCore. More complex applications may use multiple NeuronCores or even multiple processes each executing different models. The important thing to note about designing an inference application is that a single NeuronCore will always be allocated to a single process. Processes do not share NeuronCores. Different configurations can be used to ensure that an application process has enough NeuronCores allocated to execute its model(s):
Default: A process will attempt to take ownership of all NeuronCores visible on the instance. This should be used when an instance is only running a single inference process since no other process will be allowed to take ownership of any NeuronCores.
NEURON_RT_NUM_CORES
: Specify the number of NeuronCores to allocate to the process. This places no restrictions on which NeuronCores will be used, however, the resulting NeuronCores will always be contiguous. This should be used in multi-process applications where each process should only use a subset of NeuronCores.NEURON_RT_VISIBLE_CORES
: Specifies exactly which NeuronCores are allocated to the process by index. Similar toNEURON_RT_NUM_CORES
, this can be used in multi-process applications where each process should only use a subset of NeuronCores. This provides more fined-grained controls over the exact NeuronCores that are allocated to a given process.NEURONCORE_GROUP_SIZES
: Specifies a number of NeuronCore Groups which are allocated to the process. This is described in more detail in the NEURONCORE_GROUP_SIZES section.
See the NeuronX Runtime Configuration for more environment variable details.
Example: Default#
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc0
m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc1
With no environment configuration, the process will take ownership of all NeuronCores. In this example, only two of the NeuronCores are used by the process and the remaining are allocated but left idle.
Example: NEURON_RT_NUM_CORES
#
Environment Setup:
export NEURON_RT_NUM_CORES = '2'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc0
m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc1
Since there is no other process on the instance, only the first 2 NeuronCores will be acquired by the process. Models load in a simple linear order to the least used NeuronCores.
Example: NEURON_RT_VISIBLE_CORES
#
Environment Setup:
export NEURON_RT_VISIBLE_CORES = '4-5'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc4
m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc5
Unlike NEURON_RT_NUM_CORES
, setting the visible NeuronCores allows the
process to take control of a specific contiguous set. This allows an application
to have a more fine-grained control of where models will be placed.
Example: Overlapping Models#
Environment Setup:
export NEURON_RT_VISIBLE_CORES = '0-1'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc0
m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc0-nc1
m2 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc1
This shows how models may share NeuronCores but the default model placement will attempt to evenly distribute NeuronCore usage rather than overlapping all models on a single NeuronCore.
Example: Multiple Processes#
Environment Setup:
export NEURON_RT_NUM_CORES = '2'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc0
m1 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc1
In this example, if the script is run twice, the following allocations will be made:
Note that each process will take ownership of as many NeuronCores as is
specified by the NEURON_RT_NUM_CORES
configuration.
NEURONCORE_GROUP_SIZES#
Important
The use of explicit core placement should only be used when a specific
performance goal is required. By default torch-neuron
places models on
the least used NeuronCores. This should be optimal for most
applications.
Secondly, NEURONCORE_GROUP_SIZES
is being deprecated in a future
release and should be avoided in favor of newer placement methods.
Use NEURON_RT_NUM_CORES
or NEURON_RT_VISIBLE_CORES
with default
placement if possible (See Default Core Allocation & Placement)
In the current release of NeuronSDK, the most well-supported method of placing
models onto specific NeuronCores is to use the NEURONCORE_GROUP_SIZES
environment variable. This will define a set of “NeuronCore Groups” for the
application process.
NeuronCore Groups are contiguous sets of NeuronCores that are allocated to a given process. Creating groups allows an application to ensure that a model has a defined set of NeuronCores that will always be allocated to it.
Note that NeuronCore Groups can be used to allocate non-pipelined models (those requiring exactly 1 NeuronCore) to specific NeuronCores but this is not the primary intended use. The intended use of NeuronCore Groups is to ensure pipelined models (those requiring >1 NeuronCore) have exclusive access to a specific set of contiguous NeuronCores.
In the cases where models are being used without NeuronCore Pipeline, the general recommendation is to use default placement (See Default Core Allocation & Placement).
The following section demonstrates how NEURONCORE_GROUP_SIZES
can be used
and the issues that may arise.
Example: Single NeuronCore Group#
In the example where one model requires 4 NeuronCores, the correct environment configuration would be:
Environment Setup:
export NEURONCORE_GROUP_SIZES = '4'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-4-neuron-pipeline-cores.pt') # Loads to nc0-nc3
This is the most basic usage of a NeuronCore Group. The environment setup causes the process to take control of 4 NeuronCores and then the script loads a model compiled with a NeuronCore Pipeline size of 4 to the first group.
Example: Multiple NeuronCore Groups#
With more complicated configurations, the intended use of
NEURONCORE_GROUP_SIZES
is to create 1 Group per model with the correct size
to ensure that the models are placed on the intended NeuronCores. Similarly, the
environment would need to be configured to create a NeuronCore Group for each
model:
Environment Setup:
export NEURONCORE_GROUP_SIZES = '3,4,1'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-3-neuron-pipeline-cores.pt') # Loads to nc0-nc2
m1 = torch.jit.load('model-with-4-neuron-pipeline-cores.pt') # Loads to nc3-nc6
m2 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc7
Issue: Overlapping Models with Differing Model Sizes#
When multiple models are loaded to a single NeuronCore Group, this can cause unintended inefficiencies. A single model is only intended to span a single NeuronCore Group. Applications with many models of varying sizes can be restricted by NeuronCore Group configurations since the most optimal model layout may require more fine-grained controls.
Environment Setup:
export NEURONCORE_GROUP_SIZES = '2,2'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc0-nc1
m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc2-nc3
m2 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc0
m3 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc2
m4 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc0
Here the NEURONCORE_GROUP_SIZES
does not generate an optimal layout
because placement strictly follows the layout of NeuronCore Groups. A
potentially more optimal layout would be to place m4
onto nc1
. In this
case, since a pipelined model will not be able to have exclusive access to a set
of NeuronCores, the default NeuronCore placement (no NeuronCore Groups
specified) would more evenly distribute the models.
Also note here that this is an example of where the order of model loads affects which model is assigned to which NeuronCore Group. If the order of the load statements is changed, models may be assigned to different NeuronCore Groups.
Issue: Incompatible Model Sizes#
Another problem occurs when attempting to place a model which does not evenly fit into a single group:
Environment Setup:
export NEURONCORE_GROUP_SIZES = '2,2'
Python Script:
import torch
import torch_neuron
m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc0-nc1
m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc2-nc3
m2 = torch.jit.load('model-with-3-neuron-pipeline-cores.pt') # Loads to nc0-nc2
The model will be placed across NeuronCore Groups since there is no obvious group to assign the model to according to the environment variable configuration. Depending on the individual model and application requirements, the placement here may not be optimal.
Issue: Multiple Model Copies#
It is common in inference serving applications to use multiple replicas of a single model across different NeuronCores. This allows the hardware to be fully utilized to maximize throughput. In this scenario, when using NeuronCore Groups, the only way to replicate a model on multiple NeuronCores is to create a new model object. In the example below, 4 models loads are performed to place a model in each NeuronCore Group.
Environment Setup:
export NEURONCORE_GROUP_SIZES = '2,2,2,2'
Python Script:
import torch
import torch_neuron
models = list()
for _ in range(4):
model = torch.jit.load('model-with-2-neuron-pipeline-cores.pt')
models.append(model)
The largest consequence of this type of model allocation is that the application code is responsible for routing inference requests to models. There are a variety of ways to implement the inference switching but in all cases routing logic needs to be implemented in the application code.
Issue Summary#
The use of NEURONCORE_GROUP_SIZES
has the following problems:
Variable Sized Models: Models which require crossing NeuronCore Group boundaries may be placed poorly. This means group configuration limits the size of which models can be loaded.
Model Load Order: Models are loaded to NeuronCore Groups greedily. This means that the order of model loads can potentially negatively affect application performance by causing unintentional overlap.
Implicit Placement: NeuronCore Groups cannot be explicitly chosen in the application code.
Manual Replication: When loading multiple copies of a model to different NeuronCore Groups, this requires that multiple model handles are used.
Explicit Core Placement [Beta]#
To address the limitations of NEURONCORE_GROUP_SIZES
, a new set of APIs has
been added which allows specific NeuronCores to be chosen by the application
code. These can be found in the torch_neuron_core_placement_api documentation.
Example: Manual Core Selection#
The most direct usage of the placement APIs is to manually select the start NeuronCore that each model is loaded to. This will automatically use as many NeuronCores as is necessary for that model (1 for most models, >1 for NeuronCore Pipelines models).
Environment Setup:
export NEURON_RT_NUM_CORES = '4'
Python Script:
import torch
import torch_neuron
# NOTE: Order of loads does NOT matter
with torch_neuron.experimental.neuron_cores_context(2):
m1 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc2-nc3
with torch_neuron.experimental.neuron_cores_context(0):
m2 = torch.jit.load('model-with-3-neuron-pipeline-cores.pt') # Loads to nc0-nc2
with torch_neuron.experimental.neuron_cores_context(0):
m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads to nc0-nc1
with torch_neuron.experimental.neuron_cores_context(3):
m3 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads to nc3
Note that this directly solves the NEURONCORE_GROUP_SIZES
issues of:
Variable Sized Models: Now since models are directly placed on the NeuronCores requested by the application, there is no disconnect between the model sizes and NeuronCore Group sizes.
Model Load Order: Since the NeuronCores are explicitly selected, there is no need to be careful about the order in which models are loaded since they can be placed deterministically regardless of the load order.
Implicit Placement: Similarly, explicit placement means there is no chance that a model will end up being allocated to an incorrect NeuronCore Group.
Example: Automatic Multicore#
Using explicit core placement it is possible to replicate a model to multiple NeuronCores simultaneously. This means that a single model object within python can utilize all available NeuronCores (or NeuronCores allocated to the process).
Environment Setup:
export NEURON_RT_NUM_CORES = '8'
Python Script:
import torch
import torch_neuron
with torch_neuron.experimental.multicore_context():
m0 = torch.jit.load('model-with-1-neuron-pipeline-cores.pt') # Loads replications to nc0-nc7
This addresses the last NEURONCORE_GROUP_SIZES
issue of:
Manual Replication: Since models can be automatically replicated to multiple NeuronCores, this means that applications no longer need to implement routing logic and perform multiple loads.
This API has a secondary benefit that the exact same loading logic can be used
on an inf1.xlarge
or an inf1.6xlarge
. In either case, it will use all
of the NeuronCores that are visible to the process. This means that no special
logic needs to be coded for different instance types.
Example: Explicit Replication#
Replication is also possible with the
neuron_cores_context()
API. The number of
replications is chosen by replications = floor(nc_count / cores_per_model)
.
Environment Setup:
export NEURON_RT_NUM_CORES = '8'
Python Script:
import torch
import torch_neuron
with torch_neuron.experimental.neuron_cores_context(start_nc=2, nc_count=4):
m0 = torch.jit.load('model-with-2-neuron-pipeline-cores.pt') # Loads replications to nc2-nc5
This document is relevant for: Inf1