IBM Power7 Optimization And Tuning Manual page 48

Table of Contents

Advertisement

Cache
L3 cache:
Capacity/associativity
bandwidth
Optimizing for cache geometry
There are several ways to optimize for cache geometry, as described in this section.
Splitting structures into hot and cold elements
A technique for optimizing applications to take advantage of cache is to lay out data
structures so that fields that have a high rate of reference (that is, hot) are grouped, and fields
that have a relatively low rate of reference (that is, cold) are grouped.
place the hot elements into the same
the cache, they are co-located into the same cache line or lines. Additionally, because hot
elements are referenced often, they are likely to stay in the cache. Likewise, the cold
elements are in the same area of memory and result in being in the same cache line, so that
being written out to main storage and discarded causes less of a performance degradation.
This situation occurs because they have a much lower rate of access. Power Systems use
128-byte length cache lines. Compared to Intel processors (64-byte cache lines), these larger
cache lines have the advantage of increasing the reach possible with the same size cache
directory, and the efficiency of the cache by covering up to 128-bytes of hot data in a single
line. However, it also has the implication of potentially bringing more data into the cache than
needed for fine-grained accesses (that is, less than 64 bytes).
As described in Eliminate False Sharing, Stop your CPU power from invisibly going down the
20
drain,
applied to systems where there are a high number of CPU cores and a phenomenon referred
to as
false sharing
same cache line that can otherwise be accessed independently. For example, if two different
hardware threads wanted to update (store) two different words in the same cache line, only
one of them at a time can gain exclusive access to the cache line to complete the store. This
situation results in:
Cache line transfers between the processors where those threads are
Stalls in other threads that are waiting for the cache line
Leaving all but the most recent thread to update the line without a copy in their cache
This effect is compounded as the number of application threads that share the cache line
(that is, threads that are using different data in the cache line under contention) is scaled
upwards.
analyzing false sharing and suggestions for addressing the phenomenon.
19
Splitting Data Objects to Increase Cache Utilization (Preliminary Version, 9th October 1998). available at:
http://www.ics.uci.edu/%7Efranz/Site/pubs-pdf/ICS-TR-98-34.pdf
20
Eliminate False Sharing, Stop your CPU power from invisibly going down the drain, available at:
http://drdobbs.com/goparallel/article/showArticle.jhtml?articleID=217500206
21
Ibid
22
Ibid
32
POWER7 and POWER7+ Optimization and Tuning Guide
POWER7
On-Chip
4 MB/core, 8-way
16 B reads and 16 B writes per cycle
it is also important to carefully assess the impact of this strategy, especially when
can occur. False sharing occurs when multiple data elements are in the
21, 20
The discussion about cache sharing
POWER7+
On-Chip
10 MB/core, 8-way
16 B reads and 16 B writes per cycle
byte
region of memory, so that when they are pulled into
22
in also presents techniques for
19
The concept is to

Advertisement

Table of Contents
loading

This manual is also suitable for:

Power7+

Table of Contents