Predictive Failure Analysis (Pfa); Disk Scrubbing; Disk Path Redundancy - IBM TotalStorage DS6000 Series Redbooks

Concepts and architecture
Hide thumbs Also See for TotalStorage DS6000 Series:
Table of Contents

Advertisement

DDM, then approximately half of the 146 GB DDM would be wasted since that space is not
needed. The problem here is that the failed 73 GB DDM will be replaced with a new 73 GB
DDM. So the DS6000 microcode will most likely migrate the data on the 146 GB DDM onto
the recently replaced 73 GB DDM. When this process completes, the 73 GB DDM will rejoin
the array and the 146 GB will become the spare again. Another example would be if we fail a
10k RPM DDM onto a 15k RPM DDM. While this means that the data has now moved to a
faster DDM, the replacement DDM will be the same as the failed DDM. This means the spare
will now be a 10k RPM DDM. This could result in a 15k RPM DDM being spared onto a 10k
RPM DDM. This is not desirable. Again a smart failback of the spare will be performed once a
suitable replacement DDM has been made available.
Hot plugable DDMs
Replacement of a failed drive does not affect the operation of the DS6000 because the drives
are fully hot plugable. Due to the fact that each disk plugs into a switch, there is no loop break
associated with the removal or replacement of a disk. In addition, there is no potentially
disruptive loop initialization process.

3.3.4 Predictive Failure Analysis (PFA)

The drives used in the DS6000 incorporate Predictive Failure Analysis (PFA) and can
anticipate certain forms of failures by keeping internal statistics of read and write errors. If the
error rates exceed predetermined threshold values, the drive will be nominated for
replacement. Because the drive has not yet failed, data can be copied directly to a spare
drive. This avoids using RAID-5 or RAID-10 recovery to reconstruct all of the data onto the
spare drive. The DS6000 will alert the user and can also use call home e-mail notification.

3.3.5 Disk scrubbing

The DS6000 will periodically read all sectors on a disk. This is designed to occur without any
interference to application performance. If ECC-correctable bad bits are identified, the bits are
corrected immediately by the DS6000. This reduces the possibility of multiple bad bits
accumulating in a sector beyond the ability of ECC to correct them. If a sector contains data
that is beyond ECC's ability to correct, then RAID is used to regenerate the data and write a
new copy onto a spare sector on the disk. The scrubbing process applies to both array
members and spare DDMs.

3.3.6 Disk path redundancy

Each DDM in the DS6000 is attached to two 22 port SAN switches. These switches are built
into the RAID or SBOD controller cards. Figure 3-5 on page 56 depicts the redundancy
features of the DS6000 switched disk architecture. Each disk has two separate connections
to the midplane. This allows it to be simultaneously attached to both switches. If either a RAID
or SBOD controller card is removed from an enclosure, the switch that is included in that
controller is also removed. However, the remaining controller retains the ability to
communicate with all the disks via the remaining switch.
Figure 3-5 also shows the connection paths to the expansion enclosures. To the left and right
you can see paths from the switches and Fibre Channel chipset that travel to the device
adapter ports at top left and top right. These ports are depicted in Figure 2-2 on page 24.
From each controller we have two paths to each expansion enclosure. This means that we
can easily survive the loss of a single path (which would mean the loss of one out of four
paths) due to the failure of, for instance, a cable or an optical port. We can also survive the
loss of an entire RAID controller or SBOD controller (which would remove two out of four
55
Chapter 3. RAS

Advertisement

Table of Contents
loading

Table of Contents