本文将深入理解NVMe Asynchronous Event Request(AER)机制。

AER机制

Asynchronous events are used to notify host software of status, error, and health information as these events occur. To enable asynchronous events to be reported by the controller, host software needs to submit one or more Asynchronous Event Request commands to the controller. The controller specifies an event to the host by completing an Asynchronous Event Request command. Host software should expect that the controller may not execute the command immediately; the command should be completed when there is an event to be reported.

The Asynchronous Event Request command is submitted by host software to enable the reporting of asynchronous events from the controller. This command has no timeout. The controller posts a completion queue entry for this command when there is an asynchronous event to report to the host.

Asynchronous events are grouped into event types. The event type is indicated in the Asynchronous Event Type field in Dword 0 of the completion queue entry for the Asynchronous Event Request command. When the controller posts a completion queue entry for an outstanding Asynchronous Event Request command and thus reports an asynchronous event, subsequent events of that event type are automatically masked by the controller until the host clears that event. An event is cleared by reading the log page associated with that event using the Get Log Page command .

The following event types are defined:

  • Error event
  • SMART / Health Status event
  • Notice event
    • Namespace Attribute Changed
    • Firmware Activation Starting
    • Telemetry Log Changed
    • Asymmetric Namespace Access Change
    • Predictable Latency Event Aggregate Log Change
    • LBA Status Information Alert
    • Endurance Group Event Aggregate Log Page Change
  • NVM Command Set Specific events
    • Reservation Log Page Available event
    • Sanitize Operation Completed event
    • Sanitize Operation Completed With Unexpected Deallocation event
  • Vendor Specific event


Log Page Identifier与Get Log Page command的联系:

Error status、SMART / Health Status、Notice与NVM Command Set Specific Status的详细信息,请参考NVMe spec。

AER sqe预埋机制

  1. 因为driver不知道NVMe controller什么时候会发送AER事件,所以driver会往Admin SQ中预埋一些AER sqe(类似于网卡rx queue的收包机制);
  2. 当NVMe controller需要往driver发送AER事件时,就会消费一个AER sqe,然后回一个AER cqe给driver。

Case

NVM Subsystem Hardware Error Event

NVMe resize


参考资料:

  1. NVMe 1.4 spec
  2. NVMe™ SSD Management, Error Reporting and Logging Capabilities