Notes about SDM MCA

本文将mark下SDM中MCA相关notes。

1. Architecture

processors implement a machine-check architecture that provides a mechanism for detecting and reporting hardware (machine) errors, such as: system bus errors, ECC errors, parity errors, cache errors, and TLB errors. It consists of a set of model-specific registers (MSRs) that are used to set up machine checking and additional banks of MSRs used for recording errors that are detected.
the processor can report information on corrected machine-check errors and deliver a programmable interrupt for software to respond to MC errors, referred to as corrected machine-check error interrupt (CMCI).
Intel 64 processors support for software recovery from certain uncorrected recoverable machine check errors.

2. MSRs

2.1 Machine-Check Global Control MSRs

2.1.1 IA32_MCG_CAP MSR

The IA32_MCG_CAP MSR is a read-only register that provides information about the machine-check architecture of the processor.

深入了解各个field后，其实就可以对MCA整体架构有个全局的认识了。

2.1.2 IA32_MCG_STATUS MSR

The IA32_MCG_STATUS MSR describes the current state of the processor after a machine-check exception has occurred.

2.1.3 IA32_MCG_CTL MSR

IA32_MCG_CTL controls the reporting of machine-check exceptions.

2.1.4 IA32_MCG_EXT_CTL MSR

IA32_MCG_EXT_CTL.LMCE_EN (bit 0) allows the processor to signal some MCEs to only a single logical processor in the system.

2.1.5 Enabling Local Machine Check

When system software has enabled LMCE, then hardware will determine if a particular error can be delivered only to a single logical processor. Software should make no assumptions about the type of error that hardware can choose to deliver as LMCE.

2.2 Error-Reporting Register Banks

2.2.1 IA32_MCi_CTL MSRs

2.2.2 IA32_MCi_STATUS MSRs

2.2.3 IA32_MCi_ADDR MSRs

2.2.4 IA32_MCi_MISC MSRs

2.2.5 IA32_MCi_CTL2 MSRs

3. Enhanced Cache Error reporting

In earlier Intel processors, cache status was based on the number of correction events that occurred in a cache.
In “threshold-based error status”, cache status is based on the number of lines (ECC blocks) in a cache that incur repeated corrections.
A processor that supports enhanced cache error reporting contains hardware that tracks the operating status of certain caches and provides an indicator of their “health”.
- The hardware reports a “green” status when the number of lines that incur repeated corrections is at or below a pre-defined threshold
- a “yellow” status when the number of affected lines exceeds the threshold. Yellow status means that the cache reporting the event is operating correctly, but you should schedule the system for servicing within a few weeks.

4. Corrected Machine Check Error Interrupt

待另择篇幅整理

5. Recovery of Uncorrected Recoverable(UCR) Errors

Recovery of uncorrected recoverable machine check errors is an enhancement in machine-check architecture. This allow system software to perform recovery action on certain class of uncorrected errors and continue execution.

5.1 Detection of Software Error Recovery Support

The new class of architectural MCA errors from which system software can attempt recovery is called Uncorrected Recoverable (UCR) Errors. UCR errors are uncorrected errors that have been detected and signaled but have not corrupted the processor context. For certain UCR errors, this means that once system software has performed a certain recovery action, it is possible to continue execution on this processor. UCR error reporting provides an error containment mechanism for data poisoning. The machine check handler will use the error log information from the error reporting registers to analyze and implement specific error recovery actions for UCR errors.

5.2 UCR Error Reporting and Logging

IA32_MCi_STATUS MSR is used for reporting UCR errors and existing corrected or uncorrected errors.
When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC(bit61)=1
PCC(bit57)=0

In addition, the IA32_MCi_STATUS register bit fields, bits 56:55, are defined (see Figure 16-6) to provide additional information to help system software to properly identify the necessary recovery action for the UCR error:

S (Signaling) flag, bit 56
AR (Action Required) flag, bit 55

5.3 UCR Error Classification

Uncorrected no action required (UCNA) - is a UCR error that is not signaled via a machine check exception and, instead, is reported to system software as a corrected machine check error.
Software recoverable action optional (SRAO) - a UCR error is signaled either via a machine check exception or CMCI. System software recovery action is optional and not required to continue execution from this machine check exception.
Software recoverable action required (SRAR) - a UCR error that requires system software to take a recovery action on this processor before scheduling another stream of execution on this processor.

5.4 UCR Error Overwrite Rules

In general, the overwrite rules are as follows:

UCR errors will overwrite corrected errors.
Uncorrected (PCC=1) errors overwrite UCR (PCC=0) errors.
UCR errors are not written over previous UCR errors.
Corrected errors do not write over previous UCR errors.

6. Interpreting the MCA Error Codes

When the processor detects a machine-check error condition, it writes a 16-bit error code to the MCA error code field of one of the IA32_MCi_STATUS registers and sets the VAL (valid) flag in that register. The processor may also write a 16-bit model-specific error code in the IA32_MCi_STATUS register depending on the implementation of the machine-check architecture of the processor.

6.1 Simple Error Codes

Simple error codes indicate global error information.

6.2 Compound Error Codes

Compound error codes describe errors related to the TLBs, memory, caches, bus and interconnect logic, and internal timer. A set of sub-fields is common to all of compound errors. These sub-fields describe the type of access, level in the cache hierarchy, and type of request.

Transaction Type (TT) Sub-Field
Level (LL) Sub-Field
Request (RRRR) Sub-Field
Bus and Interconnect Errors
Memory Controller and Extended Memory Errors

6.3 Architecturally Defined UCR Errors

Architecturally Defined SRAO Errors
Architecturally Defined SRAR Errors

6.4 Multiple MCA Errors

When multiple MCA errors are detected within a certain detection window, the processor may aggregate the reporting of these errors together as a single event, i.e., a single machine exception condition. If this occurs, system software may find multiple MCA errors logged in different MC banks on one logical processor or find multiple MCA errors logged across different processors for a single machine check broadcast event.

文章目录