PyTorch Neuron neuron_parallel_compile CLI (torch-neuronx)
This document is relevant for:
PyTorch Neuron neuron_parallel_compile CLI (
PyTorch Neuron performs just-in-time compilation of graphs during execution. At every step, a graph is traced. If the traced graph varies from the previous executions, it is compiled by the neuron compiler. For large models, the compilation time for each graph can be high. Moreover, because of JIT, we would compile all these graphs sequentially, hence incurring huge compilation penalty.
To reduce this compilation time during execution, the
utility is provided as part of PyTorch Neuron installation. The
neuron_parallel_compile will extract graphs from a trial run of your script,
perform parallel pre-compilation of the graphs, and populate the Neuron Cache
on disk with compiled graphs. Your trial run should be limited to a few steps
(eg.10-15), enough for the utility to extract the different graphs needed for
full execution. To run the utility:
neuron_parallel_compile <run commands>
<run commands> are the commands to run a short run (i.e. 10
steps) to trace training loops for pre-compilation. The example for
the run command is
torchrun --nproc_per_node=2 <train script>, where
train script accepts
--steps_this_run option to limit number of run steps:
neuron_parallel_compile torchrun --nproc_per_node=2 <train script> --steps_this_run=10
NOTE: To avoid hang during
neuron_parallel_compile run, please make sure to use xm.save
instead of torch.save to save checkpoints.
You may notice that the output from the model is invalid when you use
neuron_parallel_compile. This is because, when you initiate your training
run command with
neuron_parallel_compile , the utility will run your command
with certrain env variables that would put your training script into graph
extraction mode. In this mode, no real execution is performed and the outputs
neuron_parallel_compile finishes compilation of all graphs, it will copy
all the compilation results into the Neuron Cache.
Now if you run your script (without
neuron_parallel_compile), it will be faster
since the compiled graphs are already cached.
torchrun --nproc_per_node=2 <train script>
Note: Except for the option to limit number of run steps (such as
the other options of
<run commands> must match between the pre-compilation and
actual run. If this is not the case, you may see additional compilations during training
run because of new graphs getting generated, resulting in cache miss.
There may be additional compilations due to unreached execution paths (in case the execution path is not reached in the first few steps of graph extraction), or changes in parameters such as number of data parallel workers.
Two additional utility flags are provided:
Set the maximum number of retries when using
neuron_parallel_compiletool. If set to N, the tool will try compilation N more time(s) if the first graph compilation failed. Example: Set NEURON_PARALLEL_COMPILE_MAX_RETRIES=1 when precompiling on trn1.2xlarge where there’s limited host memory and CPU resources. Default is 0.
When using neuron_parallel_compile, if you want to ignore the error in training script and compile the accumulated HLO graphs, you can do so by setting this environment variable. Example: If NEURON_IGNORE_TRAINING_SCRIPT_ERROR_AND_COMPILE=1 is set when using
neuron_parallel_compile, a crash in the training script would be ignored and the graphs collected upto the crash would be compiled.
This document is relevant for: