This document is relevant for: Inf2, Trn1, Trn2, Trn3

NxD Core for Training#

NeuronX Distributed Core (NxD Core) is a package for supporting different distributed training mechanisms for Neuron devices. It provides XLA-friendly implementations of some of the more popular distributed training techniques. As the size of the model scales, fitting these models on a single device becomes impossible and hence we have to make use of model sharding techniques to partition the model across multiple devices.

About NeuronX-Distributed (NxD) for Training#

NeuronX Distributed (NxD Core) provides fundamental building blocks that enable you to run advanced inference workloads on AWS Inferentia and Trainium instances. These building blocks include parallel linear layers that enable distributed inference, a model builder that compiles PyTorch modules into Neuron models, and more.

The NeuronX Distributed Training (NxD Training) library is a collection of open-source tools and libraries designed to empower customers to train PyTorch models on AWS Trainium instances. It combines both ease-of-use and access to features built on top of NxD Core library. Except for a few Trainium-specific features, NxD Training is compatible with training platforms like NVIDIA’s NeMo.

NxD Core for Inference Documentation#

This document is relevant for: Inf2, Trn1, Trn2, Trn3

NxD Core for Training

Contents

NxD Core for Training#

About NeuronX-Distributed (NxD) for Training#

NxD Core for Inference Documentation#