Scheduled Outages - IBM z13s Technical Manual

Table of Contents

Advertisement

Memory subsystem improvements
z13s servers use RAIM, which is a concept that is known in the disk industry as RAID.
RAIM design detects and recovers from DRAM, socket, memory channel, or DIMM
failures. The RAIM design includes the addition of one memory channel that is dedicated
for RAS. The parity of the four
fifth memory channel. Any failure in a memory component can be detected and corrected
dynamically.
This design takes the RAS of the memory subsystem to another level, making it
essentially a fully fault-tolerant N+1 design. The memory system on z13s servers is
implemented with an enhanced version of the Reed-Solomon error correction code (ECC)
that is known as 90B/64B, and includes protection against memory channel and DIMM
failures.
A precise marking of faulty chips helps assure timely DRAM replacements. The key cache
on the z13s memory is completely mirrored. For a full description of the memory system
on z13s servers, see 2.4, "Memory" on page 53.
Improved thermal and condensation management
Soft-switch firmware
The capabilities of soft-switching firmware have been enhanced. Enhanced logic in this
function ensures that every affected circuit is powered off during the soft switching of
firmware components. For example, if you must upgrade the microcode of a Fibre Channel
connection (FICON) feature, enhancements have been implemented to avoid any
unwanted side effects that have been detected on previous servers.
Server Time Protocol (STP) recovery enhancement
When HCA3-O (12xIFB) or HCA3-O Long Reach (LR) (1xIFB) or PCIe based ICA SR
coupling links are used, an unambiguous "going away signal" is sent when the server on
which the HCA3 is running is about to enter a failed (check stopped) state.
When the "going away signal" sent by the Current Time Server (CTS) in an STP-only
Coordinated Timing Network (CTN) is received by the Backup Time Server (BTS), the
BTS can safely take over as the CTS without relying on the previous Offline Signal (OLS)
in a two-server CTN, or as the Arbiter in a CTN with three or more servers.
Enhanced Console Assisted Recovery (ECAR) is new with z13s and z13 GA2, and
provides better recovery algorithms during a failing PTS scenario. It uses communication
over the HMC/SE network to speed up the process of BTS takeover. See "Enhanced
Console Assisted Recovery" on page 419 for more information.

9.3.2 Scheduled outages

Concurrent hardware upgrades, concurrent parts replacement, concurrent driver upgrade,
and concurrent firmware fixes are available with z13s servers, and address the elimination of
scheduled outages. Furthermore, the following indicators and functions that address
scheduled outages are included:
Double memory data bus lane sparing
This sparing reduces the number of repair actions for memory
Single memory clock sparing
Double-dynamic random access memory (DRAM) IBM Chipkill tolerance
Field repair of the cache fabric bus
Power distribution N+2 design
data
DIMMs is stored in the DIMMs that are attached to a
Chapter 9. Reliability, availability, and serviceability
359

Hide quick links:

Advertisement

Table of Contents
loading

Table of Contents