IBM Power Systems 775 Manual page 98

For aix and linux hpc solution
Table of Contents

Advertisement

Solid-state disks
GPFS Native RAID assumes several solid-state disks (SSDs) in each recovery group in order
to redundantly log changes to its internal configuration and fast-write data in non-volatile
memory, which is accessible from either the primary or backup GPFS Native RAID servers
after server failure. A typical GPFS Native RAID log VDisk might be configured as three-way
replication over a dedicated declustered array of four SSDs per recovery group.
Disk hospital
The disk hospital is a key feature of GPFS Native RAID that asynchronously diagnoses errors
and faults in the storage subsystem. GPFS Native RAID times out an individual pdisk I/O
operation after approximately 10 seconds, limiting the effect of a faulty pdisk on a client I/O
operation. When a pdisk I/O operation results in a timeout, an I/O error, or a checksum
mismatch, the suspect pdisk is immediately admitted into the disk hospital. When a pdisk is
first admitted, the hospital determines whether the error was caused by the pdisk or by the
paths to it. Although the hospital diagnoses the error, GPFS Native RAID, if possible, uses
VDisk redundancy codes to reconstruct lost or erased strips for I/O operations that otherwise
are used the suspect pdisk.
Health metrics
The disk hospital maintains internal health assessment metrics for each pdisk: time badness,
which characterizes response times; and data badness, which characterizes media errors
(hard errors) and checksum errors. When a pdisk health metric exceeds the threshold, it is
marked for replacement according to the disk maintenance replacement policy for the
declustered array.
The disk hospital logs selected Self-Monitoring, Analysis, and Reporting Technology
(SMART) data, including the number of internal sector remapping events for each pdisk.
Pdisk discovery
GPFS Native RAID discovers all connected pdisks when it starts, and then regularly
schedules a process that rediscovers a pdisk that newly becomes accessible to the GPFS
Native RAID server. This configuration allows pdisks to be physically connected or connection
problems to be repaired without restarting the GPFS Native RAID server.
Disk replacement
The disk hospital tracks disks that require replacement according to the disk replacement
policy of the declustered array. The disk hospital is configured to report the need for
replacement in various ways. The hospital records and reports the FRU number and physical
hardware location of failed disks to help guide service personnel to the correct location with
replacement disks.
When multiple disks are mounted on a removable carrier, each of which is a member of a
different declustered array, disk replacement requires the hospital to temporarily suspend
other disks in the same carrier. To guard against human error, carriers are also not removable
until GPFS Native RAID actuates a solenoid controlled latch. In response to administrative
commands, the hospital quiesces the appropriate disks, releases the carrier latch, and turns
on identify lights on the carrier that is next to the disks that require replacement.
After one or more disks are replaced and the carrier is re-inserted, in response to
administrative commands, the hospital verifies that the repair took place. The hospital also
automatically adds any new disks to the declustered array, which causes GPFS Native RAID
to rebalance the tracks and spare space across all the disks of the declustered array. If
service personnel fail to reinsert the carrier within a reasonable period, the hospital declares
the disks on the carrier as missing and starts rebuilding the affected data.
84
IBM Power Systems 775 for AIX and Linux HPC Solution

Advertisement

Table of Contents
loading

Table of Contents