本文将mark下Intel Advanced Matrix Extensions (AMX)的相关notes。

Overview

To significantly improve the throughput of the CPU for machine learning (ML) applications, Intel has integrated AMX, an on-chip matrix-multiplication(矩阵乘法) accelerator, along with Instruction Set Architecture (ISA) support, starting from the 4th generation Xeon CPUs (SPR), released in 2023.

The above Figure depicts the accelerator architecture consisting of two core components: (1) a 2D array of registers (tiles) and (2) Tile matrix multiply unit (TMUL), which are designed to support INT8 and BF16 formats. The tiles store sub-arrays of matrices and the TMUL, a 2D array of multiply-add units, operate on these tiles. The 2D structure of TMUL delivers significantly higher operations/cycle through large tile-based matrix computation, which operates with greater parallelism than 1D compute engines such as AVX engines. The CPU dispatches AMX instructions, such as tile load/store and accelerator commands, to the multi-cycle AMX units. The accelerator’s memory accesses remain coherent with the CPU’s memory accesses.

How

AMX 通过其两个核心组件协同工作来加速矩阵运算:

  • Tile:这是一组 8 个二维寄存器(TMM0-TMM7),每个大小为 1 KB,专门用于存储较大的数据块(矩阵)
  • TMUL(Tile Matrix Multiply Unit,Tile矩阵乘法单元):这是一个专用的矩阵乘法加速引擎,直接与 Tile 寄存器连接,用于执行 AI 计算中至关重要的矩阵乘法运算

AMX 支持 BF16 和 INT8 两种数据类型。BF16 在保持足够精度的同时提高了计算效率,适用于训练和推理;INT8 则主要用于推理场景,以进一步追求速度和能效。

XSAVE的支持

Tile的状态分为两部分: TILECFG和TILEDATA

1
2
3
4
5
6
7
8
9
10
// https://github.com/qemu/qemu/blob/master/target/i386/cpu.h
/* Ext. save area 17: AMX XTILECFG state */
typedef struct XSaveXTILECFG {
uint8_t xtilecfg[64];
} XSaveXTILECFG;

/* Ext. save area 18: AMX XTILEDATA state */
typedef struct XSaveXTILEDATA {
uint8_t xtiledata[8][1024];
} XSaveXTILEDATA;

总结

Intel AMX 是 Intel内置在其现代服务器 CPU 中的专用矩阵计算加速硬件,它通过专门的寄存器和执行单元,大幅提升了 CPU 执行 AI 训练和推理任务(尤其是矩阵乘法)的效率。对于希望利用现有服务器基础设施进行高效 AI 计算、同时控制成本和复杂性的企业来说,AMX 提供了一项值得关注的技术。


参考资料:

  1. Intel SDM vol1
  2. LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading(ISCA’25)
  3. deepseek prompt:简要介绍下Intel的amx技术