本文将mark下Memory Scrubbing相关notes。

What

Memory scrubbing refers to the process of correcting or ‘scrubbing’ erroneously flipped bits in memory as a result of transient faults such as those caused by physical phenomena. Scrubbing is considered a RAS feature.

Why

Memory scrubbing将错误纠正后回写,可以防止内存条上的单bit错误逐渐累积形成多bit不可纠正错误。These actions provide software with early visibility for possible preventive measures such as page off-lining based on the rate of error corrections.

When

  • Patrol scrubbing proactively searches the system memory(由硬件而非软件来做), repairing correctable errors. It prevents accumulation of single-bit errors.
  • Demand scrubbing is the ability to write corrected data back to the memory once a correctable error is detected on a read transaction.

How

Patrol Scrubbing consists in reading memory, checking it against ECC for errors, and overwriting with the corrected memory words when an error is discovered.

Patrol scrubbing is done using a hardware engine, on either the platform or on the memory device, which generates requests to memory addresses on the memory device. The engine generates memory requests at a predefined frequency. Given enough time, it will eventually access every memory address. The frequency in which patrol scrub generates requests produces no noticeable impact on the memory device’s quality of service.

By generating read requests to memory addresses, the patrol scrubber allows the hardware an opportunity to run ECC on a memory address and correct any correctable errors before they can become uncorrectable errors. Optionally, if an uncorrectable error is discovered, the patrol scrubber can trigger a hardware interrupt and notify the software layer of its memory address.


参考资料:

  1. 4th Gen Intel® Xeon® Scalable Processors: Reliability, Availability, and Serviceability (RAS) Technical Paper
  2. https://en.wikichip.org/wiki/memory_scrubbing
  3. Uncorrectable Memory Error & Patrol Scrub
  4. Reliability, Availability, and Serviceability (RAS)
  5. ACPI Software Programming Model
  6. Demand Scrubbing/Patrol Scrubbing(内存巡检)
  7. 内存错误和服务器内存RAS功能-DELL篇-1
  8. Can Linux scrub memory?