Migrating from NeMo to Neuronx Distributed Training#
Neuronx Distributed Training (NxDT) is built on top of NeMo-1.14. The framework reuses modules from NeMo and exposes them via similar config interface.
Note
At the moment, NxDT only allows running training of decoder LLM models.
This document goes over steps on how to migrate the NeMo training workload to NxDT training workload.
Model Integration#
Model already Exists in NxDT Model Hub:
If the model you want to train is already included in the NxDT model hub, and the training workflow (e.g., pre-training, fine-tuning) is supported in NxDT, migrate your NeMo YAML configuration file to the NxDT YAML file. Follow the mapping table in the Config Mapping.
Custom/New Model
If your model is not part of the NxDT model hub, please use the guide Integrating a New Model.
Dataloader Integration#
Dataloader already exposed via one of the NxDT configs
In this case, please map the NeMo YAML config parameters to NxDT config parameters using the mapping table provided here Config Mapping.
Custom/New Dataloader
If the dataloader is not part of the hub, please use the guide Integrating a new dataset/dataloader.
Optimizer/LR Scheduler Integration#
Since NxDT is built on top of NeMo, all the optimizers/LR schedulers provided by NeMo can be enabled from the config.
Optimal Partitioning#
NxDT is built on top of
NeuronxDistributed (NxD)
primitives and exposes different model parallelism techniques. All of them can be configured using
the distributed_strategy
config.
Fusions/kernels#
All the fused kernels available inside the NeMo config are not available in NxDT. This is because fused kernels in NeMo are built specifically for GPUs. Neuron have a different set of kernels that can be enabled from the config. Also, since Neuron uses a graph based approach, the compiler can optimize some of the modules and do fusions wherever required.
Checkpoint Saving/loading#
NeMo combines the model weights, optimizers and other state_dicts into a single
state_dict
and dumps a file of the format:tp_rank_0*_pp_rank_00*/model_optim_rng.ckpt
. However, with NxDT, we save the modelstate_dict
and the optimizer separately. The model statedict is saved in a folder of the form:model/dp_rank_00_tp_rank_00_pp_rank_00.pt
and the optimizer is saved into a separate folder as:optim/dp_rank_00_tp_rank_00_pp_rank_00.pt
. This is mainly done so that when we use zero1, each DP rank can save its own optimizer shard.NxDT doesn’t support
.nemo
style checkpoint saving. If users have a.nemo
checkpoint, they would have to unpack it themselves and build a checkpoint conversion script to load the checkpoint into NxDT.In NeMo, if we are using pipeline parallel, each pipeline stage creates an independent model. So lets say we have a model with 32 layers and we use PP=4, then NeMo would create 4 chunks with layers 0-7. So each PP rank would have a
model_state_dict
with keys going from layer-0-7. However, in NxDT, the model is created as a whole and then sharded. So the layer numbers are preserved.One would have to write up a checkpoint conversion script similar to the checkpoint conversion from NeMo to NxDT.
For a more detailed mapping of NeMo parameters to NxDT parameters, follow the guide Config Mapping.
Config Mapping#
Here is a detailed mapping for all the parameters in the config file. For the below mapping, we chose the Llama example across both NeMo and NxDT frameworks. The same mapping is also true for other models.
NeMo param |
NxDT param mapping |
Comments |
---|---|---|
name |
name |
|
restore_from_path |
Not supported |
|
trainer |
||
devices |
devices |
|
num_nodes |
num_nodes |
|
accelerator |
Not required |
We made the default as TPU which maps to Neuron internally, so users no longer have to add it. |
precision |
replaced by |
There is a separate precision config to control the precision of model and optimizer. |
logger |
Not required |
We set default value of logger to False. |
enable_checkpointing |
Separate |
All checkpointing is controlled by exp_manager config. |
use_distributed_sampler |
Not supported |
|
max_epochs |
max_epochs |
|
max_steps |
max_steps |
|
log_every_n_steps |
log_every_n_steps |
|
val_check_interval |
val_check_interval |
|
limit_val_batches |
limit_val_batches |
|
limit_test_batches |
limit_test_batches |
|
accumulate_grad_batches |
Removed |
This is automatically configured based on global_batchsize, micro-batchsize and distributed config. |
gradient_clip_val |
gradient_clip_val |
|
benchmark |
Not supported |
|
enable_model_summary |
Not supported |
|
exp_manager |
||
log_local_rank_0_only |
log_local_rank_0_only |
|
create_tensorboard_logger |
create_tensorboard_logger |
|
explicit_log_dir |
explicit_log_dir |
|
exp_dir |
exp_dir |
|
name |
name |
|
create_wandb_logger |
Not supported |
This was not supported under NNM, either. We have removed this argument from NxDT. |
wandb_logger_kwargs |
Not supported |
|
resume_if_exists |
resume_if_exists |
|
resume_ignore_no_checkpoint |
resume_ignore_no_checkpoint |
|
create_checkpoint_callback |
create_checkpoint_callback |
|
checkpoint_callback_params |
checkpoint_callback_params |
|
model |
||
mcore_gpt |
Not supported |
NxDT has its own implementation of megatron_gpt_model which is based on v1.14 version of NeMo |
tensor_model_parallel_size |
|
All the parallelism config are moved to distributed_strategy config |
pipeline_model_parallel_size |
|
|
virtual_pipeline_model_parallel_size |
|
|
sequence_parallel |
|
|
micro_batch_size |
|
All the dataset/dataloader/tokenizer configuration are now part of a separate config called data |
global_batch_size |
|
|
tokenizer |
|
|
data |
Moved to |
The entire |
encoder_seq_length |
encoder_seq_length |
|
max_position_embeddings |
max_position_embeddings |
|
make_vocab_size_divisible_by |
make_vocab_size_divisible_by |
|
pre_process |
Not supported |
NxDT by default adds embedding layer at the start of the transformer block. |
post_process |
Not supported |
NxDT by default adds a LM-head at the end of the transformer block. |
persist_layer_norm |
persist_layer_norm |
|
share_embeddings_and_output_weights |
share_embeddings_and_output_weights |
|
position_embedding_type |
position_embedding_type |
|
rotary_percentage |
rotary_percentage |
|
transformer_block_type |
transformer_block_type |
|
has_bias |
has_bias |
|
num_query_groups |
Not required |
query group attention can be configured using |
native_amp_init_scale |
Not Required |
|
native_amp_growth_interval |
Not Required |
GPU optimizations which were not supported in NNM, have been removed from NxDT. Most of these fusion ops, the neuron compiler handles on its own. For Attention and Softmax, Neuron uses NKI kernels and custom ops to implement them |
hysteresis |
Not Required |
|
fp32_residual_connection |
Not Required |
|
fp16_lm_cross_entropy |
Not Required |
|
megatron_amp_O2 |
Not Required |
|
grad_div_ar_fusion |
Not Required |
|
gradient_accumulation_fusion |
Not Required |
|
bias_activation_fusion |
Not Required |
|
bias_dropout_add_fusion |
Not Required |
|
masked_softmax_fusion |
|
|
seed |
seed is moved out of model and at the same level as model |
|
resume_from_checkpoint |
|
|
use_cpu_initialization |
use_cpu_initialization |
|
onnx_safe |
Not supported |
This was not supported under NNM too, we have removed this argument from NxDT. |
apex_transformer_log_level |
Not supported |
|
gradient_as_bucket_view |
Not supported |
|
sync_batch_comm |
Not supported |
|
activations_checkpoint_granularity |
activations_checkpoint_granularity |
By default NxDT checkpoints attention module in case of selective and a single layer in case of full checkpointing. |
activations_checkpoint_method |
Not supported |
|
activations_checkpoint_num_layers |
Not supported |
|
num_micro_batches_with_partial_activation_checkpoints |
Not supported |
|
activations_checkpoint_layers_per_pipeline |
Not supported |
|
disable_layer_norm_checkpointing |
Not supported |
|
transformer_engine |
Not supported |
This is specifically built for NVIDIA GPUs. |
fp8 |
Not supported |
fp8 training is not supported on Neuron (both NNM and NxDT). |
fp8_e4m3 |
Not supported |
|
fp8_hybrid |
Not supported |
|
fp8_margin |
Not supported |
|
use_emha |
Not supported |
|
nsys_profile |
Not supported |
This is specifically built for NVIDIA GPUs. |
optim |
optim |
Note
For parameters that are not supported by NxDT, please create a feature request with specific use-case for the parameter, if needed.