This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Tutorials#

Profiling Multi-Node Training Jobs

Learn how to analyze performance during multi-node training on AWS Trainium instances with SLURM job scheduling.

Profiling a vLLM Inference Workload

Learn how to capture and analyze device-level and system-level profiles for vLLM inference workloads on AWS Trainium.

Profiling a NKI Kernel

Learn how to profile a NKI kernel with Neuron Explorer.

Profiling PyTorch Neuron with TensorBoard

Learn how to use Neuron’s plugin for TensorBoard that allows users to measure and visualize performance on a torch runtime level or an operator level.

Track System Resource Utilization during Training with Neuron Monitor

Learn how to monitor resource utilization using neuron-monitor, Prometheus and Grafana while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.

Track Training Progress in TensorBoard using PyTorch Neuron

Learn how to track training progress in TensorBoard while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3