Locks ............. .34 Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems . . .35 Chapter 4 Conclusions .
Page 4
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems A.2.1 What Resources Are Used When a Single Read-Only or Write-Only Thread Accesses Remote Data? ......40 A.2.2...
Page 5
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems List of Figures Figure 1. Quartet Topology ............14 Figure 2.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Revision History Date Revision Description June 2006 3.00 Initial release. Revision History...
Page 8
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Revision History...
At the same time, the SMP architecture does not scale well into larger systems with a greater number of processors. The AMD ccNUMA architecture is designed to overcome these inherent SMP performance bottlenecks. It is a mature architecture that is designed to extract greater performance potential from multiprocessor systems.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations.
Page 11
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems [12] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/ msdn_heapmm.asp [13] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/ low_fragmentation_heap.asp http://msdn2.microsoft.com/en-us/library/tt15eb9t.aspx [14] https://www.pathscale.com/docs/UserGuide.pdf [15] [16] http://docs.sun.com/source/819-3688/parallel.html Chapter 1 Introduction...
Page 12
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Introduction Chapter 1...
Quartet is a common way of connecting and routing the processors on other supported 4P AMD platforms. We anticipate that these results should hold on other systems that are connected in a similar manner and we expect the recommendations to carry forward on the current generation Opteron systems.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Link Link Link Link Figure 1. Quartet Topology The term hop is commonly used to describe access distances on NUMA systems. When a thread accesses memory on the same node as that on which it is running, it is a 0-hop access or local access.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 4 GV/s per direction 4 GV/s per direction @ 2 GHz Data Rate @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate HT = HyperTransport™...
Page 16
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems resources approach saturation. The test has two modes: read-only and write-only. When the test threads are read-only, the throughput does not stress the capacity of the system resources and, thus, the test is more sensitive to latency.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems characterization of the resource behavior in the system. These recommendations, coupled with these interesting cases, provide an understanding of the low-level behavior of the system, which is crucial to the analysis of larger real-world workloads.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 2.3.2 Labels Used Each of the bars on the graph is labeled with the hop information for the thread. 2.3.3 Y-Axis Display For the one-thread test cases on the idle system, the graphs show the time taken by a single thread, normalized to the time taken by the fastest single-thread case—in this case the time it takes a read-...
Core 1 on node 0, node 1, node 2 and node 3 in any order The two cores on each node of the dual-core AMD Opteron™ processor share the Northbridge resources, which include the memory controller and the physical memory that is connected to that node.
In other words, schedule using core major order first followed by node major order. For example, when scheduling threads that share data on a dual-core Quartet system, AMD recommends using the following order: •...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems distance. If they are indirectly connected to each other in a 4P configuration, it is considered as a 2 hop access distance. The following example—extracted from mining the results of the synthetic test case—substantiates the recommendation to keep data local.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems T im e f o r w r ite 1 .8 1 .6 1 4 9 % 1 .4 12 9% 1 27 % 113 % 1 .2...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is enough physical memory available on that node. If enough physical memory is not available on the node, then various advanced techniques are used to determine where to place the data, depending on the OS.
Page 24
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems afterwords no longer needs the data structure and if only one of the worker threads needs the data structure. In other words, the data structure is not truly shared between the worker threads.
Data can often be restructured so that such cache-line sharing does not occur. Cache lines on AMD Athlon™ 64 and AMD Opteron™ processors are currently 64 bytes, but a scheme that avoids this problem, regardless of cache-line size, makes for more performance-portable code. For example, a multithreaded application should avoid using statically defined shared arrays and variables that are potentially located in a single cache line and shared between threads.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems • Threads firing at each other (crossfire) The first thread runs on node 0 and writes to memory on node 1 (1 hop). The second thread runs on node 1 and writes to memory on node 0 (1 hop).
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Here the same two foreground threads as before were run though the cases as before—local, crossfire and no crossfire. In addition, four background threads are left running on: •...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems VERY HIGH: Total Time for both threads (write-write) 195% 186% 158% 1 Hop 1 Hop 1 Hop 0 Hop 1 Hop 0 Hop...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems VERY HIGH: Total Time for both threads (write-write) 216% 202% 156% 1 Hop 1 Hop 1 Hop 0 Hop 1 Hop 0 Hop...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems This analogy clearly communicates the performance effects of queuing time versus latency. In a computer server, with many concurrent outstanding memory requests, we would gladly incur some...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems However, as shown in Figure 11 on page 31, when both threads are write-only, the 0 hop-1 hop and 0 hop-2 hop cases are faster than the 0 hop-0 hop case.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems In addition, three background threads are running on nodes 1, 2 and 3. Each of these background threads access data locally. The rate of memory demand by each these threads is varied simultaneously from low to medium to high to very high as shown in Table 1 on page 16.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Medium: Total Time for both threads (write-write) 146% 139% 129% 129% 0 Hop 0 Hop 0 Hop 0 Hop 0 Hop 1 Hop...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Very High: Total Time for both threads (write-write) 169% 158% 158% 147% 0 Hop 0 Hop 0 Hop 0 Hop 0 Hop 1 Hop...
Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems Several compilers for AMD multiprocessor systems provide additional hooks to allow automatic parallelization of otherwise serial programs. Several compilers also support the OpenMP API for parallel programming. For details about support for auto parallelization and OpenMP in various compilers, see the references [4], [14], [15] and [16].
Page 36
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Analysis and Recommendations Chapter 3...
When scheduling multiple threads that mostly share data with each other on an idle dual-core AMD multiprocessor system, schedule threads on both cores of an idle node first and then move on to the next idle node and so on. In other words, schedule using core major order first followed by node major order.
Page 38
The buffer queues constitute one such resource. The lengths of these queues are configured by the BIOS with some hardware-specific limits that are specified in the BIOS Kernel and Developers Guide for the particular processor. Following AMD recommendations, the BIOS allocates these buffers on a link-by-link basis to optimize for the most common workloads.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Appendix A The following sections provide additional explanatory information on topics discussed in the previous sections of this document. Description of the Buffer Queues Figure 16 shows the internal resources in each Quartet node.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Likewise packets to be transmitted from the MCT to the XBar are queued in the “MCT-to-XBar” buffers. The buffers in the SRI, XBar and MCT can be viewed as staggered queues on the various units.
The buffer lengths are BIOS configurable within some hardware-specific limits that are specified in the appropriate BIOS Kernel and Developers Guide for the processor under consideration. Following AMD recommendations, the BIOS allocates these buffers on a link-by-link basis to optimize for the most common workloads.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Why Is the No Crossfire Case Slower Than the Crossfire Case on a System under a Very High Background Load (Full Subscription)? When the threads are firing at each other (crossfire) and all other free cores are running background...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Why Is 0 Hop-1 Hop Case Slower Than 0 Hop-0 Hop Case on a System under High Background Load (High Subscription) for Write-...
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Tools and APIs for Thread/Process and Memory Placement (Affinity) for AMD64 ccNUMA Multiprocessor Systems This following sections discuss tools and APIs available for assigning thread/process and memory affinity under various operating systems.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Controlling Memory Affinity Both numactl and libnuma library functions can be used to set memory affinity[5]. Memory affinity set by tools like numactl applies to all the data accessed by the entire program (including child processes).
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems The function to set memory affinity for a thread is VirtualAlloc( )[9]. This function gives the developer the choice to bind memory immediately on allocation or to defer binding until first touch.
A.8.4 Node Interleaving Configuration in the BIOS AMD Opteron™ and Athlon™ 64 ccNUMA multiprocessor systems can be configured in the BIOS to interleave all memory across all nodes on a page basis (4KB for regular pages and 2M for large pages).
Page 48
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Appendix A...
Need help?
Do you have a question about the AMD Athlon 64 and is the answer not in the manual?
Questions and answers