NxD Inference Tutorials#

Welcome to the NeuronX Distributed (NxD) Inference tutorials collection. These step-by-step guides help you deploy and optimize large language models (LLMs) on AWS Neuron hardware. Learn how to run various models like Llama3, GPT, and more with different optimization techniques including speculative decoding, tensor parallelism, and disaggregated inference.

Llama3.1 405B on Trn2

Learn how to deploy Llama3.1 405B on a single Trn2 instance using NxD Inference with vLLM and explore performance optimization techniques.

Llama3.3 70B on Trn2

Deploy Llama3.3 70B on Trn2 instances and learn how to optimize performance with tensor parallelism and other NxD Inference features.

Llama3.3 70B with Data Parallelism

Explore data parallelism techniques for Llama3.3 70B on Trn2 to increase throughput for high-volume inference workloads.

Llama3.1 8B with Multi-LoRA

Learn how to use multiple LoRA adapters with Llama3.1 8B on Trn2 for efficient fine-tuning and domain-specific inference.

Llama3.1 405B with Speculative Decoding

Optimize Llama3.1 405B inference on Trn2 using vanilla fused speculative decoding techniques for improved performance.

Llama3.1 70B Instruct Accuracy Evaluation

Evaluate the accuracy of Llama3.1 70B Instruct model on Trn1 hardware and learn how to measure model performance.

Disaggregated Inference

Implement disaggregated inference to distribute model components across multiple instances for large-scale LLM deployment.

Disaggregated Inference (1P1D)

Learn about the 1P1D (1 Prefill, 1 Decode) pattern for disaggregated inference to optimize latency and throughput.

GPT-OSS on Trainium3

Deploy open-source GPT models on Trainium3 hardware using NxD Inference and explore Trn3-specific optimizations.

Llama3.3 70B with APC

Deploy Llama3.3 70B on Trn2 with Attention Pattern Caching (APC) to improve inference performance for repetitive patterns.

Llama4 Tutorial

Deploy and optimize Llama4 models on AWS Neuron hardware using NxD Inference with various performance tuning options.

Generating Results with Performance CLI

Use the Performance CLI tool to benchmark and generate performance results for NxD Inference deployments.

Flux Inference

Learn how to use Flux for efficient inference with NxD, enabling dynamic batch processing and optimized resource utilization.

Flux Inpainting

Learn how to use the Flux-Fill model for efficient inference with NxD, enabling image inpainting/outpainting.

Pixtral Tutorial

Learn how to deploy mistralai/Pixtral-Large-Instruct-2411 on a single trn2.48xlarge instance.

Qwen3 MoE Inference

Learn how to deploy Qwen/Qwen3-235B-A22B with NxD Inference with various performance tuning options.

Qwen3 VL 8B Tutorial

Learn how to deploy Qwen/Qwen3-VL-8B-Thinking on a single trn2.48xlarge instance.

Speculative Decoding (Qwen3-32B) on Trainium2

Implement speculative decoding techniques with Qwen3-32B on Trn2 instances to accelerate LLM inference with NxD Inference.

Qwen2 VL Inference

Learn how to deploy Qwen/Qwen2-VL-7B-Instruct with NxD Inference with various performance tuning options.