This document is relevant for: Trn1, Trn2, Trn3
NxD Training#
This section contains the technical documentation specific to the NxD Training library included with the Neuron SDK.
What is NxD Training?#
NxD Training (NeuronX Distributed Training) is a PyTorch library for end-to-end distributed training on AWS Trainium instances. It combines ease-of-use with powerful features built on top of the NxD Core library, offering turnkey support for model pre-training, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT) using LoRA.
With NxD Training, developers can:
Train large-scale models with turnkey workflows for pre-training, SFT, and PEFT (LoRA)
Leverage distributed strategies including Data Parallelism, Tensor Parallelism, Sequence Parallelism, Pipeline Parallelism, and ZeRO-1
Use PyTorch Lightning integration for organized training code
Access ready-to-use model samples based on HuggingFace and Megatron-LM formats
Manage experiments with integrated checkpointing, logging, and S3 storage support
Choose from three usage interfaces: YAML configuration files, PyTorch Lightning APIs, or NxD Core primitives
NxD Training is compatible with training platforms like NVIDIA’s NeMo (except for Trainium-specific features) and is available on GitHub as both pip wheel and source code.
Usage Interfaces#
NxD Training provides three interfaces to meet different developer needs:
YAML Configuration Files: High-level access for distributed training with minimal code changes
PyTorch Lightning APIs: Standardized training workflows with NxD Core primitives
NxD Core Primitives: Low-level APIs for custom model integration and advanced use cases
NxD Training documentation#
This document is relevant for: Trn1, Trn2, Trn3