Special Uncorrectable Error Handling - IBM Power 720 Overview

Hide thumbs Also See for Power 720:

Overview (59 pages)

Installation manual (64 pages)

Table Of Contents

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

page of 206

/ 206
Contents
Table of Contents
Bookmarks

Table of Contents

L2 and L3 array protection

The L2 and L3 caches in the POWER7+ processor are protected with double-bit detect

single-bit correct error detection code (ECC). Single-bit errors are corrected before forwarding

to the processor and are subsequently written back to the L2 and L3 cache.

In addition, the caches maintain a cache-line-delete capability. A threshold of correctable

errors detected on a cache line can result in the data in the cache line being purged and the

cache line removed from further operation without requiring a reboot. An ECC uncorrectable

error detected in the cache can also trigger a purge and delete of the cache line. This results

in no loss of operation because an unmodified copy of the data can be held on system

memory to reload the cache line from main memory. Modified data is handled through Special

Uncorrectable Error handling.

L2 and L3 deleted cache lines are marked for persistent deconfiguration on subsequent

system reboots until they can be replaced.

4.2.5 Special Uncorrectable Error handling

While it is rare, an uncorrectable data error can occur in memory or a cache. IBM POWER

processor-based systems attempt to limit the impact of an uncorrectable error to the least

possible disruption, using a well-defined strategy that first considers the data source.

Sometimes, an uncorrectable error is temporary in nature and occurs in data that can be

recovered from another repository, as in the following example:

Data in the instruction L1 cache is never modified within the cache itself. Therefore, an

uncorrectable error discovered in the cache is treated like an ordinary cache miss, and

correct data is loaded from the L2 cache.

The L2 and L3 cache of the POWER7+ processor-based systems can hold an unmodified

copy of data in a portion of main memory. In this case, an uncorrectable error simply

triggers a reload of a cache line from main memory.

In cases where the data cannot be recovered from another source, a technique called Special

Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory or

cache from immediately causing the system to terminate. Rather, the system tags the data

and determines whether it will ever be used again:

If the error is irrelevant, SUE will not force a checkstop.

If data is used, termination can be limited to the program, kernel or hypervisor owning the

data, or freeze of the I/O adapters controlled by an I/O hub controller if data is going to be

transferred to an I/O device.

When an uncorrectable error is detected, the system modifies the associated ECC word,

thereby signaling to the rest of the system that the "standard" ECC is no longer valid. The

service processor is then notified and takes appropriate actions. When running AIX 5.2 or

later or Linux and a process attempts to use the data, the operating system is informed of the

error and might terminate, or only terminate a specific process associated with the corrupt

data, depending on the operating system and firmware level and whether the data was

associated with a kernel or non-kernel process.

It is only in the case where the corrupt data is used by the POWER Hypervisor that the entire

system must be rebooted, thereby preserving overall system integrity.

158

IBM Power 720 and 740 Technical Overview and Introduction

Table of Contents

Show Quick Links

Hide quick links:

Table of Contents

Need help?

Do you have a question about the Power 720 and is the answer not in the manual?

This manual is also suitable for:

Power 740

Special Uncorrectable Error Handling - IBM Power 720 Overview

4.2.5 Special Uncorrectable Error handling

Hide quick links:

Need help?

Subscribe to Our Youtube Channel

Related Manuals for IBM Power 720

Related Content for IBM Power 720

This manual is also suitable for:

Table of Contents