Developer Guides
Comprehensive guides for using NxD Inference (neuronx-distributed-inference) to deploy and optimize machine learning models on AWS Inferentia and AWS Trainium accelerators. These guides cover model onboarding, performance optimization, quantization techniques, integration with vLLM, and other advanced features to help you maximize the performance of your models on AWS Neuron hardware.
Use the NxD Inference (neuronx-distributed-inference) Developer Guides to learn how to use NxD Inference.
Accuracy Evaluation with Datasets
Guide for evaluating model accuracy using datasets to ensure model quality and performance.
Custom Quantization
Guide for implementing custom quantization techniques to optimize model size and performance.
Disaggregated Inference
Guide for using disaggregated inference architecture that separates prefill and decode phases for improved performance.
Feature Guide
Overview of NxD Inference features and configuration options for optimizing model deployment.
How to Use FPEM
Guide for using Fast Parameter-Efficient Module (FPEM) for efficient model fine-tuning.
LLM Inference Benchmarking Guide
Guide for benchmarking LLM inference performance to optimize deployment configurations.
Migrate from TNX to NxDI
Guide for migrating from Transformers NeuronX to NxD Inference with step-by-step instructions.
Model Reference
Reference for production-ready models supported by NxD Inference and their configuration options.
MoE Architecture Deep Dive
Deep dive into Mixture of Experts (MoE) architecture implementation in NxD Inference.
NxD Examples Migration Guide
Guide for migrating examples to NxD Inference from other frameworks or previous versions.
Onboarding Models
Guide for onboarding new models to NxD Inference with detailed implementation steps.
Performance CLI Parameters
Guide for performance tuning using command-line interface parameters for optimal model execution.
vLLM User Guide (Legacy)
Guide for using vLLM v0.x with NxD Inference (Legacy version) for LLM inference and serving.
vLLM User Guide v1
Guide for using vLLM v1.x with NxD Inference for efficient LLM inference and serving.
Weights Sharding Guide
Guide for implementing weights sharding to distribute model parameters across multiple devices.
Writing Tests
Guide for writing tests for NxD Inference models to ensure accuracy and performance.