IBM System/370 Manual page 170

Hide thumbs Also See for System/370:
Table of Contents

Advertisement

accomplished by programmed recovery to allow system operations to
continue whenever possible and
by
the recording of system status for
both transient (corrected) and permanent (uncorrected) hardware errors.
MACHINE CHECK HANDLER:
During IPL of a control program containing
Model 165 RMS routines, machine check mask bits are enabled, and control
register values are set to permit all machine check interrupts and
logouts to occur.
MeH receives control after the occurrence of both soft and hard
machine check interrupts.
When a soft machine check occurs (successful
CPU retry, single-bit processor storage error corrected, time of day
clock damage, or multiple-bit processor storage error during an I/O
operation), MCH formats a recovery report record to
be
written in the
system error recording data set SYS1.LOGREC.
This record contains
pertinent information about the error, including pertinent data from
the logout areas, an indication of the recovery that occurred,
identification of the job, job step, and program involved in the error,
the date, and the time of day.
The operator is informed of successful
CPU retries, single-bit processor storage corrections, and an error
in the time of day clock.
MeR performs an additional function when a CPU retry was necessary
because of a buffer malfunction.
When an error occurs in the buffer,
as indicated in the extended logout area, MCR updates a programmed
buffer error counter.
After a certain number of buffer errors occur,
the entire high-speed buffer is disabled and MCH notifies the operator
of this fact.
The operator can allow the system to continue running
in degraded mode, if necessary.
All CPU fetches are then made directly
to processor storage, bypassing the buffer.
Alternately, the operator
can terminate system operations and request CE diagnosis and repair
of the buffer.
Prior to relinquishing CPU control, MCH determines whether or not
an automatic mode switch from recording mode to quiet mode should take
place if a CPU retry or an ECC correction recovery has just occurred.
The determination of whether to switch to nonrecording (quiet) mode
is made on the basis of the number of soft machine checks of a specific
type that occur during system operation.
Error count thresholds are
maintained separately for successful CPU retry and successful processor
storage single-bit error corrections.
The IBM-supplied threshold
values can be altered when the control program is generated.
MCR switches the system to quiet mode for either ECC corrections
only (the DIAGNOSE instruction
i~
used to change the ECC mode bit from
full recording to quiet mode) or for both CPU retry and ECC corrections
(the System Recovery mask bit is disabled).
Mode switching occurs
if the number of soft machine checks that occur during system operation
exceeds the specified error count threshold for that type (or if
SYS1.LOGREC is full).
The operator is informed of the mode switch
and can switch back to recording mode at any time thereafter.
Mode switching is implemented to attempt to prevent SYS1.LOGREC
from being filled with recovery reports when a recurring correctable
error condition exists that would cause many reports to be generated.
When a System Damage hard machine check occurs (uncorrectable or
unretryable CPU error, multiple-bit processor storage error, or a
storage protect key failure), MeR determines whether the error is one
that is correctable by programming.
A multiple-bit processor storage
error or a storage protect key failure associated with CPU processing
causes control to be given to the repair portion of the program damage
assessment and repair (PDAR) routine of MCH.
PDAR can repair damaged
control program storage areas by loading a new copy of the affected
module if the module is marked reentrant and refreshable (it has been
80

Advertisement

Table of Contents
loading

This manual is also suitable for:

165

Table of Contents