Notes about GPU vs NPU
本文将mark下GPU vs NPU(Neural Processing Unit)的相关notes,建议先看下科普视频比GPU更快!NPU是如何实现AI加速的?。
GPU vs NPU
NPU内部的基本单元还是乘加器,只是改变了数据的流动方式,不再需要运回缓存,从而提高了运算效率。
NPUs are specialized hardware accelerators that excel at performing neural network computations efficiently. In contrast to general-purpose processors, NPUs are specifically optimized to meet the unique requirements of neural networks. They offer capabilities such as massive parallelism, high-speed data processing and matrix computation. These features are essential for handling intensive computations involved in deep learning algorithms.
To meet these requirements, NPU adopts several special hardware structures. One key unit for the modern NPU is the matrix calculation unit (MCU) like tube, systolic array, etc. With these matrix units, NPU can execute the matrix calculation like multiplication and convolution within one operation. Some NPUs also have other dedicated units for some specialized operations like sparse matrix calculation, activation function, etc.
Besides the matrix unit, NPUs often adopt a near data computing (NDC) architecture to minimize data retrieval overhead. For example, weights in neural networks are pre-stored in the SRAM/scratchpad near the matrix unit, allowing for quick access during computations. This reduces latency and energy consumption by eliminating the need to retrieve weights from main memory for each task. NDC optimizes data flow and improves computational efficiency by minimizing memory access bottlenecks.
Furthermore, NPUs also leverage multi-core architectures with a Network-on-Chip (NoC) network to further parallelize data computation. The NoC network allows for direct data transfer among NPU cores without the need for additional memory load/store instructions.
CUDA Cores vs Tensor Cores
A100/H100等GPU,不仅仅有通用的CUDA Cores,还集成了NPU中的Cores。
CUDA cores are responsible for general-purpose processing tasks in GPUs, which handle a wide range of instructions including integer operations, floating-point operations, load/store operations, etc. CUDA cores execute scalar (or vector) instructions operating on individual (or vector) data elements.
Tensor cores are specialized hardware designed for accelerating matrix multiplication. Tensor cores have 16.0×/14.8× higher FLOPS than CUDA cores on A100/H100 GPUs. Besides, Tensor cores work at a coarse-grained granularity, e.g. performing a matrix multiplication between two FP16 matrices of shape 16 × 16 and 16 × 8 with a single mma (matrix multiply and accumulate) instruction.
参考资料: