Migrating from NeMo to Neuronx Distributed Training#

Neuronx Distributed Training (NxDT) is built on top of NeMo-1.14. The framework reuses modules from NeMo and exposes them via similar config interface.

Note

At the moment, NxDT only allows running training of decoder LLM models.

This document goes over steps on how to migrate the NeMo training workload to NxDT training workload.

Model Integration#

Model already Exists in NxDT Model Hub:

If the model you want to train is already included in the NxDT model hub, and the training workflow (e.g., pre-training, fine-tuning) is supported in NxDT, migrate your NeMo YAML configuration file to the NxDT YAML file. Follow the mapping table in the Config Mapping.

Custom/New Model

If your model is not part of the NxDT model hub, please use the guide Integrating a New Model.

Dataloader Integration#

Dataloader already exposed via one of the NxDT configs

In this case, please map the NeMo YAML config parameters to NxDT config parameters using the mapping table provided here Config Mapping.

Custom/New Dataloader

If the dataloader is not part of the hub, please use the guide Integrating a new dataset/dataloader.

Optimizer/LR Scheduler Integration#

Since NxDT is built on top of NeMo, all the optimizers/LR schedulers provided by NeMo can be enabled from the config.

Optimal Partitioning#

NxDT is built on top of NeuronxDistributed (NxD) primitives and exposes different model parallelism techniques. All of them can be configured using the distributed_strategy config.

Fusions/kernels#

All the fused kernels available inside the NeMo config are not available in NxDT. This is because fused kernels in NeMo are built specifically for GPUs. Neuron have a different set of kernels that can be enabled from the config. Also, since Neuron uses a graph based approach, the compiler can optimize some of the modules and do fusions wherever required.

Checkpoint Saving/loading#

  1. NeMo combines the model weights, optimizers and other state_dicts into a single state_dict and dumps a file of the format: tp_rank_0*_pp_rank_00*/model_optim_rng.ckpt. However, with NxDT, we save the model state_dict and the optimizer separately. The model statedict is saved in a folder of the form: model/dp_rank_00_tp_rank_00_pp_rank_00.pt and the optimizer is saved into a separate folder as: optim/dp_rank_00_tp_rank_00_pp_rank_00.pt. This is mainly done so that when we use zero1, each DP rank can save its own optimizer shard.

  2. NxDT doesn’t support .nemo style checkpoint saving. If users have a .nemo checkpoint, they would have to unpack it themselves and build a checkpoint conversion script to load the checkpoint into NxDT.

  3. In NeMo, if we are using pipeline parallel, each pipeline stage creates an independent model. So lets say we have a model with 32 layers and we use PP=4, then NeMo would create 4 chunks with layers 0-7. So each PP rank would have a model_state_dict with keys going from layer-0-7. However, in NxDT, the model is created as a whole and then sharded. So the layer numbers are preserved.

  4. One would have to write up a checkpoint conversion script similar to the checkpoint conversion from NeMo to NxDT.

For a more detailed mapping of NeMo parameters to NxDT parameters, follow the guide Config Mapping.

Config Mapping#

Here is a detailed mapping for all the parameters in the config file. For the below mapping, we chose the Llama example across both NeMo and NxDT frameworks. The same mapping is also true for other models.

NeMo param

NxDT param mapping

Comments

name

name

restore_from_path

Not supported

trainer

devices

devices

num_nodes

num_nodes

accelerator

Not required

We made the default as TPU which maps to Neuron internally, so users no longer have to add it.

precision

replaced by precision_config

There is a separate precision config to control the precision of model and optimizer.

logger

Not required

We set default value of logger to False.

enable_checkpointing

Separate exp_manager config

All checkpointing is controlled by exp_manager config.

use_distributed_sampler

Not supported

max_epochs

max_epochs

max_steps

max_steps

log_every_n_steps

log_every_n_steps

val_check_interval

val_check_interval

limit_val_batches

limit_val_batches

limit_test_batches

limit_test_batches

accumulate_grad_batches

Removed

This is automatically configured based on global_batchsize, micro-batchsize and distributed config.

gradient_clip_val

gradient_clip_val

benchmark

Not supported

enable_model_summary

Not supported

exp_manager

log_local_rank_0_only

log_local_rank_0_only

create_tensorboard_logger

create_tensorboard_logger

explicit_log_dir

explicit_log_dir

exp_dir

exp_dir

name

name

create_wandb_logger

Not supported

This was not supported under NNM, either. We have removed this argument from NxDT.

wandb_logger_kwargs

Not supported

resume_if_exists

resume_if_exists

resume_ignore_no_checkpoint

resume_ignore_no_checkpoint

create_checkpoint_callback

create_checkpoint_callback

checkpoint_callback_params

checkpoint_callback_params

model

mcore_gpt

Not supported

NxDT has its own implementation of megatron_gpt_model which is based on v1.14 version of NeMo

tensor_model_parallel_size

distributed_strategy.tensor_model_parallel_size

All the parallelism config are moved to distributed_strategy config

pipeline_model_parallel_size

distributed_strategy.pipeline_model_parallel_size

virtual_pipeline_model_parallel_size

distributed_strategy.virtual_pipeline_model_parallel_size

sequence_parallel

distributed_strategy.sequence_parallel

micro_batch_size

data.micro_batch_size

All the dataset/dataloader/tokenizer configuration are now part of a separate config called data

global_batch_size

data.global_batch_size

tokenizer

data.tokenizer

data

Moved to data at the same level as model

The entire data key now controls a DataModule and is placed at the same level as model key in the config structure.

encoder_seq_length

encoder_seq_length

max_position_embeddings

max_position_embeddings

make_vocab_size_divisible_by

make_vocab_size_divisible_by

pre_process

Not supported

NxDT by default adds embedding layer at the start of the transformer block.

post_process

Not supported

NxDT by default adds a LM-head at the end of the transformer block.

persist_layer_norm

persist_layer_norm

share_embeddings_and_output_weights

share_embeddings_and_output_weights

position_embedding_type

position_embedding_type

rotary_percentage

rotary_percentage

transformer_block_type

transformer_block_type

has_bias

has_bias

num_query_groups

Not required

query group attention can be configured using num_kv_heads parameter.

native_amp_init_scale

Not Required

native_amp_growth_interval

Not Required

GPU optimizations which were not supported in NNM, have been removed from NxDT. Most of these fusion ops, the neuron compiler handles on its own. For Attention and Softmax, Neuron uses NKI kernels and custom ops to implement them

hysteresis

Not Required

fp32_residual_connection

Not Required

fp16_lm_cross_entropy

Not Required

megatron_amp_O2

Not Required

grad_div_ar_fusion

Not Required

gradient_accumulation_fusion

Not Required

bias_activation_fusion

Not Required

bias_dropout_add_fusion

Not Required

masked_softmax_fusion

fusions.softmax

seed

seed is moved out of model and at the same level as model

resume_from_checkpoint

exp_manager.resume_from_checkpoint

use_cpu_initialization

use_cpu_initialization

onnx_safe

Not supported

This was not supported under NNM too, we have removed this argument from NxDT.

apex_transformer_log_level

Not supported

gradient_as_bucket_view

Not supported

sync_batch_comm

Not supported

activations_checkpoint_granularity

activations_checkpoint_granularity

By default NxDT checkpoints attention module in case of selective and a single layer in case of full checkpointing.

activations_checkpoint_method

Not supported

activations_checkpoint_num_layers

Not supported

num_micro_batches_with_partial_activation_checkpoints

Not supported

activations_checkpoint_layers_per_pipeline

Not supported

disable_layer_norm_checkpointing

Not supported

transformer_engine

Not supported

This is specifically built for NVIDIA GPUs.

fp8

Not supported

fp8 training is not supported on Neuron (both NNM and NxDT).

fp8_e4m3

Not supported

fp8_hybrid

Not supported

fp8_margin

Not supported

use_emha

Not supported

nsys_profile

Not supported

This is specifically built for NVIDIA GPUs.

optim

optim

Note

For parameters that are not supported by NxDT, please create a feature request with specific use-case for the parameter, if needed.