2.4.6.2
Write Errors
Write data operations involve a minimum of two nodes. The commander
issues the command and transmits the data. A memory node acknowl-
edges as the slave, provides the timing for the data transaction, and re-
ceives the data. All other nodes check to see if their cache is sharing the
data and may assert TLSB_SHARED. Nodes that assert TLSB_SHARED
may also receive the data and check it for errors, or they may invalidate
the block in their cache.
Uncorrectable write errors are usually fatal to a CPU and result in a crash.
By the time the CPU learns of the write error, it has lost the context of
where the data came from. When a CPU writes data, the data is written
into cache. Sometime later the data gets evicted from the cache because
the cache block is needed for another address.
Correctable write errors should cause no harm to the system. But leaving a
memory location written with a single-bit error may result in an unknown
number of correctable read errors depending on how many times the loca-
tion is read before it is written again. A CPU will most likely read and re-
write this data location to correct the data in memory. If write errors are
corrected, read errors from memory can be treated as memory failures.
A commander does not always set error bits due to a write error. The com-
mander receives the TLSB_DATA_ERROR signal from one or more nodes
that received the data with errors. The assertion of TLSB_DATA_ERROR
tells the commander to set <DTDE> in its TLBER register, indicating that
it transmitted the data, and takes any other appropriate action to inform
the requester (for example, CPU). The error registers in all nodes must be
examined to determine the extent of the error.
1.
2.
3.
4.
Correctable write data error interrupts may be disabled. This is usually
done after the system has logged a number of these errors and may discon-
tinue logging, but software prefers to continue collecting error information.
The system can continue to operate reliably while software polls for error
information because the data will be corrected and multiple bit errors will
If the commander has <CWDE> or <UDE> set in the TLBER register,
analysis of the TLESRn registers is necessary to learn more. Which of
the four TLESRn registers to look at can be determined by which DSn
bits are set in the TLBER register. If <TCE> is set, the commander
failed while writing the data to the bus. This is most likely a failure
on the module, but could also be the result of another node driving
data at the same time or a bus failure. If <TCE> is not set, the data
corruption happened in the commander node.
If no error bits are set in the commander, the transmit checks passed
on the data and check bits. This is a good indication that data corrup-
tion occurred somewhere on the bus or in a receiving node.
Each receiving node with <CWDE> set received the data with a single-
bit error. A memory node wrote the data into storage and also latched
the address in the TLFADRn registers. The data can be rewritten. If
the commander has no error bits set, the receiving node most likely
has receiver problems.
Each receiving node with <UDE> set received the data with multiple
bit errors. A memory node wrote the data into storage and also
latched the address in the TLFADRn registers. If the commander has
no error bits set, the receiving node most likely has receiver problems.
TLSB Bus 2-45
Need help?
Do you have a question about the AlphaServer 8200 and is the answer not in the manual?