Notes about the NVMe protocol.

1. Motivation

传统的SATA硬盘只能支持一个队列,一次只能接受32条数据;而NVMe存储则支持最多64K个队列,每个队列有64K个条目。类似于跑车的例子,SATA就像只有一条车道的公路,可以容纳32辆车;而NVMe就像有6.4万条车道的公路,每条车道都能容纳6.4万辆汽车。

2. Basics

Storage protocol designed for the Non-Volatile Memory

  • Defines the commands and data strutures for communication between the host and the storage device
  • Can operation over PCIe or Fabrics


Here are some basic definitions in NVMe protocols. NVMe defines two main types of commands: Admin Commands and I/O Commands. In I/O operations, commands are placed by the host software into the Submission Queue (SQ), and completion information received from SSD hardware is then placed into an associated Completion Queue (CQ) by the controller. NVMe separately designs SQ and CQ pairs for any Admin and I/O commands respectively. The host system maintains only one Admin SQ and its associated Admin CQ for the purpose of storage management and command control, while the host can maintain a maximum of 64K I/O SQs or CQs. The depth of the Admin SQ or CQ is 4K, where the Admin Queue can store at most 4096 entries, while the depth of I/O Queues is 64K. SQ and CQ should work in pairs, and normally one SQ utilizes on one CQ or multiple SQs utilize the same CQ to meet the requirements of high performances in multithread I/O processing. A SQ or CQ is a ring buffer and it is a memory area which is shared with the device that can be accessed by Direct Memory Access (DMA). Moreover, a doorbell is a register of the NVMe device controller to record the head or tail pointer of the ring buffer (SQ or CQ).

3. Sequences of NVMe over PCIe

A specific command in a NVMe IO request contains concrete read/write messages and an address pointing to the DMA buffer if the IO request is a DMA operation. Once the request is stored in a SQ, the host writes the doorbell and kicks (transfers) the request into the NVMe device so that the device can fetch I/O operations. After an IO request has been completed, the device will subsequently write the success or failure status of the request into a CQ and the device then generates an interrupt request into the host. After the host receives the interrupt and processes the completion entries, it writes to the doorbell to release the completion entries.

4. NVMe-oF

NVMe over PCIe 局限在主机的本地盘使用。通过Fabrics(如RDMA或光纤通道)代替PCIe,可帮助主机访问节点外的NVMe SSD资源,NVMe-oF极大地增强了灵活性和扩展性,将NVMe低延时、高并发等特性,从服务器级别,扩展到整个数据中心级别。

与NVMe over PCIe相比,NVMe over RDMA在软件开销上的增加很小,可以近似地认为跨网络访问和本地访问的延迟几乎是一样的。

5. Details

最权威的资料当然是Spec
osdev的总结也不错。


参考资料:

  1. MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through,ATC’18
  2. NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation,SYSTOR’17
  3. Linux开源存储全栈详解:从Ceph到容器存储
  4. 深入剖析NVMe Over Fabrics