This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Tutorials
Profiling Multi-Node Training Jobs
Learn how to analyze performance during multi-node training on AWS Trainium instances with SLURM job scheduling.
Profiling a vLLM Inference Workload
Learn how to capture and analyze device-level and system-level profiles for vLLM inference workloads on AWS Trainium.
Profiling a NKI Kernel
Learn how to profile a NKI kernel with Neuron Explorer.
Profiling PyTorch Neuron with TensorBoard
Learn how to use Neuron’s plugin for TensorBoard that allows users to measure and visualize performance on a torch runtime level or an operator level.
Track System Resource Utilization during Training with Neuron Monitor
Learn how to monitor resource utilization using neuron-monitor, Prometheus and Grafana while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.
Track Training Progress in TensorBoard using PyTorch Neuron
Learn how to track training progress in TensorBoard while running a multi-layer perceptron MNIST model on Trainium using PyTorch Neuron.
This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3