This document is relevant for: Inf2, Trn1, Trn2
NxD Inference Tutorials
Welcome to the NeuronX Distributed (NxD) Inference tutorials collection. These step-by-step guides help you deploy and optimize large language models (LLMs) on AWS Neuron hardware. Learn how to run various models like Llama3, GPT, and more with different optimization techniques including speculative decoding, tensor parallelism, and disaggregated inference.
Llama
Llama3.1 405B on Trn2
Learn how to deploy Llama3.1 405B on a single Trn2 instance using NxD Inference with vLLM and explore performance optimization techniques.
Llama3.1 405B with Speculative Decoding
Optimize Llama3.1 405B inference on Trn2 using vanilla fused speculative decoding techniques for improved performance.
Llama3.1 70B Instruct Accuracy Evaluation
Evaluate the accuracy of Llama3.1 70B Instruct model on Trn1 hardware and learn how to measure model performance.
Llama3.1 8B with Multi-LoRA
Learn how to use multiple LoRA adapters with Llama3.1 8B on Trn2 for efficient fine-tuning and domain-specific inference.
Llama3.3 70B with Speculative Decoding
Deploy Llama3.3 70B on Trn2 instances and learn how to optimize performance with tensor parallelism and other NxD Inference features.
Llama3.3 70B with Data Parallelism
Explore data parallelism techniques for Llama3.3 70B on Trn2 to increase throughput for high-volume inference workloads.
Llama3.3 70B with APC
Deploy Llama3.3 70B on Trn2 with Automatic Prefix Caching (APC) to improve inference performance for repetitive patterns.
Llama3.3 70B FP8 on Trainium2
Deploy Llama3.3 70B FP8 quantized model on Trainium2.
Llama4
Deploy and optimize Llama4 models on AWS Neuron hardware using NxD Inference with various performance tuning options.
Qwen
Qwen3 MoE 235B
Learn how to deploy Qwen/Qwen3-235B-A22B with NxD Inference with various performance tuning options.
Speculative Decoding (Qwen3-32B) on Trainium2
Implement speculative decoding techniques with Qwen3-32B on Trn2 instances to accelerate LLM inference with NxD Inference.
GPT
GPT-OSS 120B on Trainium3
Deploy open-source GPT models on Trainium3 hardware using NxD Inference and explore Trn3-specific optimizations.
Flux
Flux Inference
Learn how to use Flux for efficient inference with NxD, enabling dynamic batch processing and optimized resource utilization.
Flux Inpainting
Learn how to use the Flux-Fill model for efficient inference with NxD, enabling image inpainting/outpainting.