Developer Guides#

Comprehensive guides for using NxD Inference (neuronx-distributed-inference) to deploy and optimize machine learning models on AWS Inferentia and AWS Trainium accelerators. These guides cover model onboarding, performance optimization, quantization techniques, integration with vLLM, and other advanced features to help you maximize the performance of your models on AWS Neuron hardware.

Use the NxD Inference (neuronx-distributed-inference) Developer Guides to learn how to use NxD Inference.

Accuracy Evaluation with Datasets

Guide for evaluating model accuracy using datasets to ensure model quality and performance.

Custom Quantization

Guide for implementing custom quantization techniques to optimize model size and performance.

Disaggregated Inference

Guide for using disaggregated inference architecture that separates prefill and decode phases for improved performance.

Feature Guide

Overview of NxD Inference features and configuration options for optimizing model deployment.

How to Use FPEM

Guide for using Fast Parameter-Efficient Module (FPEM) for efficient model fine-tuning.

LLM Inference Benchmarking Guide

Guide for benchmarking LLM inference performance to optimize deployment configurations.

Migrate from TNX to NxDI

Guide for migrating from Transformers NeuronX to NxD Inference with step-by-step instructions.

Model Reference

Reference for production-ready models supported by NxD Inference and their configuration options.

MoE Architecture Deep Dive

Deep dive into Mixture of Experts (MoE) architecture implementation in NxD Inference.

NxD Examples Migration Guide

Guide for migrating examples to NxD Inference from other frameworks or previous versions.

Onboarding Models

Guide for onboarding new models to NxD Inference with detailed implementation steps.

Performance CLI Parameters

Guide for performance tuning using command-line interface parameters for optimal model execution.

vLLM User Guide (Legacy)

Guide for using vLLM v0.x with NxD Inference (Legacy version) for LLM inference and serving.

vLLM User Guide v1

Guide for using vLLM v1.x with NxD Inference for efficient LLM inference and serving.

Weights Sharding Guide

Guide for implementing weights sharding to distribute model parameters across multiple devices.

Writing Tests

Guide for writing tests for NxD Inference models to ensure accuracy and performance.