Dimm Sparing Function - Intel SE7520JR2 Technical Manual

Server board technical product specification
Table of Contents

Advertisement

Functional Architecture
Intel® Server Board SE7520JR2
3.3.6.5

DIMM Sparing Function

To provide a more fault tolerant system, the Intel E7520 MCH includes specialized hardware to
support fail-over to a spare DIMM device in the event that a primary DIMM in use exceeds a
specified threshold of runtime errors. One of the DIMMs installed per channel, greater than or
equal in size than all installed, will not be used but kept in reserve. In the event of significant
failures in a particular DIMM, it and its corresponding partner in the other channel (if applicable),
will, over time, have its data copied over to the spare DIMM(s) held in reserve. When all the
data has been copied, the reserve DIMM(s) will be put into service and the failing DIMM will be
removed from service. Only one sparing cycle is supported. If this feature is not enabled, then
all DIMMs will be visible in normal address space.
Note: The DIMM Sparing feature requires that the spare DIMM be at least the size of the largest
primary DIMM in use.
Hardware additions for this feature include the implementation of tracking register per DIMM to
maintain a history of error occurrence, and a programmable register to hold the fail-over error
threshold level. The operational model is straightforward: if the fail-over threshold register is set
to a non-zero value, the feature is enabled, and if the count of errors on any DIMM exceeds that
value, fail-over will commence. The tracking registers themselves are implemented as "leaky
buckets," such that they do not contain an absolute cumulative count of all errors since power-
on; rather, they contain an aggregate count of the number of errors received over a running time
period. The "drip rate" of the bucket is selectable by software, so it is possible to set the
threshold to a value that will never be reached by a "healthy" memory subsystem experiencing
the rate of errors expected for the size and type of memory devices in use.
The fail-over mechanism is slightly more complex. Once fail-over has been initiated the MCH
must execute every write twice; once to the primary DIMM, and once to the spare. The MCH will
also begin tracking the progress of its built-in memory scrub engine. Once the scrub engine has
covered every location in the primary DIMM, the duplicate write function will have copied every
data location to the spare. At that point, the MCH can switch the spare into primary use, and
take the failing DIMM off-line.
Note that this entire mechanism requires no software support once it has been programmed and
enabled, until the threshold detection has been triggered to request a data copy. Hardware will
detect the threshold initiating fail-over, and escalate the occurrence of that event as directed
(signal an SMI, generate an interrupt, or wait to be discovered via polling). Whatever software
routine responds to the threshold detection must select a victim DIMM (in case multiple DIMMs
have crossed the threshold prior to sparing invocation) and initiate the memory copy. Hardware
will automatically isolate the "failed" DIMM once the copy has completed. The data copy is
accomplished by address aliasing within the DDR control interface, thus it does not require
reprogramming of the DRAM row boundary (DRB) registers, nor does it require notification to
the operating system that anything has occurred in memory.
The memory mirroring feature and DIMM sparing are exclusive of each other, only one may be
activated during initialization. The selected feature must remain enabled until the next power-
cycle. There is no provision in hardware to switch from one feature to the other without
rebooting, nor is there a provision to "back out" of a feature once enabled without a full reboot.
44
Revision 1.0
C78844-002

Advertisement

Table of Contents
loading

This manual is also suitable for:

Se7520jr2atad2

Table of Contents