This document is relevant for: Inf2

Inferentia2 Architecture#

At the heart of the Inf2 instance are up to 12 Inferentia2 devices (each Inferentia2 include 2 NeuronCore-v2). Inferentia2 is the second generation purpose built Machine Learning inference accelerator from AWS. The Inferentia2 device architecture is depicted below:

Each Inferentia2 device consists of:

Compute:
- 2x NeuronCore-v2 cores, delivering 380 INT8 TOPS, 190 FP16/BF16/cFP8/TF32 TFLOPS, and 47.5 FP32 TFLOPS.
Device Memory:
- 32GB of HBM of device memory (for storing model state), with 820 GB/sec of bandwidth.
Data movement:
- 1 TB/sec of DMA bandwidth, with inline memory compression/decompression.
NeuronLink:
- NeuronLink-v2 for device-to-device interconnect enables high performance collective compute for co-optimization of latency and throughput.
Programmability:
- Inferentia2 supports dynamic shapes and control flow, via ISA extensions of NeuronCore-v2 and custom-operators via the deeply embedded GPSIMD engines.

This document is relevant for: Inf2

AWS Neuron Documentation

Inferentia2 Architecture

Inferentia2 Architecture#