This document is relevant for: Trn1, Trn2, Trn3

NxD Training#

This section contains the technical documentation specific to the NxD Training library included with the Neuron SDK.

What is NxD Training?#

NxD Training (NeuronX Distributed Training) is a PyTorch library for end-to-end distributed training on AWS Trainium instances. It combines ease-of-use with powerful features built on top of the NxD Core library, offering turnkey support for model pre-training, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT) using LoRA.

With NxD Training, developers can:

  • Train large-scale models with turnkey workflows for pre-training, SFT, and PEFT (LoRA)

  • Leverage distributed strategies including Data Parallelism, Tensor Parallelism, Sequence Parallelism, Pipeline Parallelism, and ZeRO-1

  • Use PyTorch Lightning integration for organized training code

  • Access ready-to-use model samples based on HuggingFace and Megatron-LM formats

  • Manage experiments with integrated checkpointing, logging, and S3 storage support

  • Choose from three usage interfaces: YAML configuration files, PyTorch Lightning APIs, or NxD Core primitives

NxD Training is compatible with training platforms like NVIDIA’s NeMo (except for Trainium-specific features) and is available on GitHub as both pip wheel and source code.

Usage Interfaces#

NxD Training provides three interfaces to meet different developer needs:

  • YAML Configuration Files: High-level access for distributed training with minimal code changes

  • PyTorch Lightning APIs: Standardized training workflows with NxD Core primitives

  • NxD Core Primitives: Low-level APIs for custom model integration and advanced use cases

NxD Training documentation#

Overview

Learn about NxD Training architecture, key features, and usage interfaces for distributed training on AWS Trainium.

Setup

Step-by-step instructions for installing and configuring NxD Training on Trainium instances.

Tutorials

Hands-on tutorials for training various models including Llama, GPT, and BERT with different parallelism strategies.

Developer Guides

In-depth guides for model integration, YAML configuration, migration from NeMo/NNM, and advanced training workflows.

API Reference

Comprehensive API documentation for NxD Training modules, configuration options, and programming interfaces.

Application Notes

Detailed application notes on distributed strategies, optimization techniques, and best practices for training.

Misc Resources

Known issues, troubleshooting guides, and other helpful resources for working with NxD Training.

NxD Training Release Notes

Review the latest updates, new features, and bug fixes in NxD Training releases.

This document is relevant for: Trn1, Trn2, Trn3