Inferentia2 Architecture
This document is relevant for: Inf2
Inferentia2 Architecture#
At the heart of the Inf2 instance are up to 12 Inferentia2 devices (each Inferentia2 include 2 NeuronCore-v2). Inferentia2 is the second generation purpose built Machine Learning inference accelerator from AWS. The Inferentia2 device architecture is depicted below:

Each Inferentia2 device consists of:
- Compute:
2x NeuronCore-v2 cores, delivering 380 INT8 TOPS, 190 FP16/BF16/cFP8/TF32 TFLOPS, and 47.5 FP32 TFLOPS.
- Device Memory:
32GB of HBM of device memory (for storing model state), with 820 GB/sec of bandwidth.
- Data movement:
1 TB/sec of DMA bandwidth, with inline memory compression/decompression.
- NeuronLink:
NeuronLink-v2 for device-to-device interconnect enables high performance collective compute for co-optimization of latency and throughput.
- Programmability:
Inferentia2 supports dynamic shapes and control flow, via ISA extensions of NeuronCore-v2 and custom-operators via the deeply embedded GPSIMD engines.
This document is relevant for: Inf2