Software Transactional Memory Acceleration; Summary; Implementation - IBM A2 User Manual

Table of Contents

Advertisement

In this example, the mtdcr is reconfiguring the I/O device in a manner that would cause the preceding store
instruction to fail, were the mtdcr to change the device before the completion of the store. Because mtdcr is
not a storage access instruction, the use of mbar instead of msync does not guarantee that the store is
performed before letting the mtdcr reconfigure the device. It only guarantees that subsequent storage
accesses are not performed to memory or any device before the earlier store.
Now consider this next example:
stb X
Store data to an I/O device at address X, causing a status bit at address Y to be
reset.
mbar
Guarantee preceding store is performed to the device before any subsequent
storage accesses are performed.
lbz Y
Load status from the I/O device at address Y.
Here, mbar is appropriate instead of msync, because all that is required is that the store to the I/O device
happens before the load does, but not that other instructions subsequent to the mbar will not get executed
before the store.

2.15 Software Transactional Memory Acceleration

2.15.1 Summary

The A2 core is augmented with support for three new instructions: ldawx (load double-word and set watch
indexed), wchkall (watch check all), and wclr (watch clear). These instructions are used to control a moni-
toring facility that detects writes by other threads to watched memory locations. For more information, see
Section 12.4 Software Transactional Memory Instructions on page 509.
A thread can execute a sequence of ldawx instructions, setting watches for multiple memory locations, with
one or more wchkall operations to detect whether any of its watched locations have potentially been written
by another thread. The set of watches can then be cleared with a wclr instruction. If the number of watches
exceeds the capacity of the watch facility, subsequent wchkall instructions will conservatively indicate that
one of the watched cache blocks has been written by another thread.

2.15.2 Implementation

Three user-level instructions interact with a set of watch bits associated with the L1 D-cache. One bit per
thread per cache block is added to the L1 D-cache to capture the common-case working set of watches.
The ldawx and wchkall instructions are performance critical. These instructions are fully-pipelined with
performance similar to conventional load instructions.
The wclr instruction is less performance critical. When performed with a nonzero EA, the wclr instruction
should be performed sequentially with respect to other memory operations to the same location. When
performed with an EA of 0, wclr must simply complete before any subsequent ldawx instruction is able to
complete (that is, gating ldawx instructions at dispatch pending a wclr instruction should be sufficient). When
wclr is executed with EA = 0, a signal is raised to the L1 D-cache indicating that all of the watch bits should
be flash cleared, and the watchlost sticky bit for the thread performing the wclr should be set to the L value
from the instruction.
When wchkall is executed, the watchlost sticky bit (part of the L1 D-cache, see Section 2.15.2.1) corre-
sponding to the executing thread is probed, and the CR is updated appropriately.
Version 1.3
October 23, 2012
User's Manual
A2 Processor
CPU Programming Model
Page 125 of 864

Advertisement

Table of Contents
loading

Table of Contents