本文将mark下ACPI ERST(Error Record Serialization Table)的相关notes。

Motivation

Linux uses the persistent storage filesystem, pstore, to record information (eg. dmesg tail) upon panics and shutdowns. Pstore is independent of, and runs before, kdump. In certain scenarios (ie. hosts/guests with root filesystems on NFS/iSCSI where networking software and/or hardware fails, and thus kdump fails), pstore may contain information available for post-mortem debugging(崩溃后调试).

Save the hardware error log into flash via ERST before go panic, the hardware error log can be gotten from the flash after system boot successful again, which is very useful in production.

Linux pstore

PSTORE provides a file system interface that allows you to read and delete the contents of ramoops through the file system in the user layer.

Supports saving dmesg, console, ftrace and pmsg (frontend) to the backend devices like ram, blk, mtd (memory technology devices). PSTORE core would provide relevant interfaces to communicate between frontend and backend.

Linux pstore是一种持久化存储机制,专为系统崩溃或重启时自动保存内核日志而设计,使开发者能在系统恢复后分析崩溃原因,特别适用于无法实时监控的远程设备或小概率崩溃问题。

ACPI ERST是pstore filesystem的一种storage backend。

ACPI ERST DEVICE PCI Interface

The ERST device is a PCI device with two BARs, one for accessing the programming registers, and the other for accessing the record exchange buffer.

BAR0 contains the programming interface consisting of ACTION and VALUE 64-bit registers. All ERST actions/operations/side effects happen on the write to the ACTION, by design. Any data needed by the action must be placed into VALUE prior to writing ACTION. Reading the VALUE simply returns the register contents, which can be updated by a previous ACTION.

BAR1 contains the 8KiB record exchange buffer, which is the implemented maximum record size.

serialize and deserialize MCE error record

On X86 platform, the kernel has supported to serialize and deserialize MCE error record by commit 482908b49ebf (“ACPI, APEI, Use ERST for persistent storage of MCE”). The process involves two steps:

  • MCE Producer: When a hardware error is detected, MCE raised and its handler writes MCE error record into flash via ERST before panic
  • MCE Consumor: After system reboot, /sbin/mcelog run, it reads /dev/mcelog to check flash for error record of previous boot via ERST

参考资料:

  1. ACPI ERST DEVICE
  2. Use ERST for persistent storage of MCE and APEI errors
  3. ACPI, APEI support
  4. 千问
  5. Do you know how ‘PSTORE’ works