Processor Core/Cache Correctable Error Handling; Processor Instruction Retry And Other Try Again Techniques; Alternative Processor Recovery And Partition Availability Priority - IBM Power System E850C Technical Overview And Introduction

Hide thumbs Also See for Power System E850C:

Installing (68 pages)

Table Of Contents

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

page of 160

/ 160
Contents
Table of Contents
Bookmarks

Table of Contents

software layers. The software layers must then be responsible for determining how to

minimize the impact of faults.

The advanced RAS features that are built in to POWER8 processor-based systems handle

certain "uncorrectable" errors in ways that minimize the impact of the faults. These features

can even keep an entire system up and running after experiencing such a failure.

Depending on the fault, such recovery might use the virtualization capabilities of PowerVM in

such a way that the operating system or any applications that are running in the system are

not affected or required to participate in the recovery.

4.3.3 Processor Core/Cache correctable error handling

Layer 2 (L2) and Layer 3 (L3) caches and directories can correct single bit errors and detect

double bit errors (SEC/DED ECC). Soft errors that are detected in the level 1 caches are also

correctable by a try again operation that is handled by the hardware. Internal and external

processor "fabric" busses have SEC/DED ECC protection as well.

SEC/DED capabilities are also included in other data arrays that are not directly visible to

customers.

Beyond soft error correction, the intent of the POWER8 design is to manage a solid

correctable error in an L2 or L3 cache by using techniques to delete a cache line with a

persistent issue, or to repair a column of an L3 cache dynamically by using spare capability.

Information about column and row repair operations is stored persistently for processors. This

process allows more permanent repairs to be made during processor reinitialization (during

system reboot, or individual Core Power on Reset using the Power On Reset Engine).

4.3.4 Processor Instruction Retry and other try again techniques

Within the processor core, soft error events might occur that interfere with the various

computation units. When such an event is detected before a failing instruction is completed,

the processor hardware might be able to try the operation again by using the advanced RAS

feature that is known as

Processor Instruction Retry allows the system to recover from soft faults that otherwise result

in an outage of applications or the entire server.

Try again techniques are used in other parts of the system as well. Faults that are detected on

the memory bus that connects processor memory controllers to DIMMs can be tried again. In

POWER8 systems, the memory controller is designed with a replay buffer that allows memory

transactions to be tried again after certain faults internal to the memory controller faults are

detected. This process complements the try again abilities of the memory buffer module.

4.3.5 Alternative processor recovery and Partition Availability Priority

If Processor Instruction Retry for a fault within a core occurs multiple times without success,

the fault is considered to be a solid failure. In some instances, PowerVM can work with the

processor hardware to migrate a workload running on the failing processor to a spare or

alternative processor. This migration is accomplished by migrating the pertinent processor

core state from one core to another with the new core taking over at the instruction that failed

on the faulty core. Successful migration keeps the application running during the migration

without needing to terminate the failing application.

Processor Instruction Retry

Chapter 4. Reliability, availability, and serviceability

107

Table of Contents

Need help?

Do you have a question about the Power System E850C and is the answer not in the manual?

Processor Core/Cache Correctable Error Handling; Processor Instruction Retry And Other Try Again Techniques; Alternative Processor Recovery And Partition Availability Priority - IBM Power System E850C Technical Overview And Introduction

4.3.3 Processor Core/Cache correctable error handling

4.3.4 Processor Instruction Retry and other try again techniques

4.3.5 Alternative processor recovery and Partition Availability Priority

Need help?

Subscribe to Our Youtube Channel

Related Manuals for IBM Power System E850C

Related Content for IBM Power System E850C

Table of Contents