AMD Athlon 64 Manuallines

Performance guidelines for multiprocessor systems

Hide thumbs Also See for AMD Athlon 64:

User manual (76 pages)

Thermal design manual (82 pages)

Table Of Contents

Table of Contents

Quick Links

Download this manual See also: User Manual

Performance Guidelines for

AMD Athlon™ 64 and

AMD Opteron™ ccNUMA

Multiprocessor Systems

Application Note

Publication #

40555

Revision: 3.00

Issue Date:

June 2006

Table of Contents

Need help?

Do you have a question about the AMD Athlon 64 and is the answer not in the manual?

Questions and answers

Summary of Contents for AMD AMD Athlon 64

Page 1: Application Note
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Application Note Publication # 40555 Revision: 3.00 Issue Date: June 2006...
Page 2 AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur.
Page 3: Table Of Contents
Locks ............. .34 Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems . . .35 Chapter 4 Conclusions .
Page 4 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems A.2.1 What Resources Are Used When a Single Read-Only or Write-Only Thread Accesses Remote Data? ......40 A.2.2...
Page 5 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems List of Figures Figure 1. Quartet Topology ............14 Figure 2.
Page 6: Performance Guidelines For Amd Athlon™ 64 And Amd Opteron™ 40555 Rev. 3.00 June
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems List of Figures...
Page 7: Revision History
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Revision History Date Revision Description June 2006 3.00 Initial release. Revision History...
Page 8 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Revision History...
Page 9: Chapter 1 Introduction
At the same time, the SMP architecture does not scale well into larger systems with a greater number of processors. The AMD ccNUMA architecture is designed to overcome these inherent SMP performance bottlenecks. It is a mature architecture that is designed to extract greater performance potential from multiprocessor systems.
Page 10: Related Documents
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations.
Page 11 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems [12] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/ msdn_heapmm.asp [13] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/ low_fragmentation_heap.asp http://msdn2.microsoft.com/en-us/library/tt15eb9t.aspx [14] https://www.pathscale.com/docs/UserGuide.pdf [15] [16] http://docs.sun.com/source/819-3688/parallel.html Chapter 1 Introduction...
Page 12 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Introduction Chapter 1...
Page 13: Experimental Setup
Quartet is a common way of connecting and routing the processors on other supported 4P AMD platforms. We anticipate that these results should hold on other systems that are connected in a similar manner and we expect the recommendations to carry forward on the current generation Opteron systems.
Page 14: Figure 1. Quartet Topology
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Link Link Link Link Figure 1. Quartet Topology The term hop is commonly used to describe access distances on NUMA systems. When a thread accesses memory on the same node as that on which it is running, it is a 0-hop access or local access.
Page 15: Synthetic Test
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 4 GV/s per direction 4 GV/s per direction @ 2 GHz Data Rate @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate HT = HyperTransport™...
Page 16 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems resources approach saturation. The test has two modes: read-only and write-only. When the test threads are read-only, the throughput does not stress the capacity of the system resources and, thus, the test is more sensitive to latency.
Page 17: Reading And Interpreting Test Graphs
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems characterization of the resource behavior in the system. These recommendations, coupled with these interesting cases, provide an understanding of the low-level behavior of the system, which is crucial to the analysis of larger real-world workloads.
Page 18: Labels Used
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems 2.3.2 Labels Used Each of the bars on the graph is labeled with the hop information for the thread. 2.3.3 Y-Axis Display For the one-thread test cases on the idle system, the graphs show the time taken by a single thread, normalized to the time taken by the fastest single-thread case—in this case the time it takes a read-...
Page 19: Chapter 3 Analysis And Recommendations
Core 1 on node 0, node 1, node 2 and node 3 in any order The two cores on each node of the dual-core AMD Opteron™ processor share the Northbridge resources, which include the memory controller and the physical memory that is connected to that node.
Page 20: Multiple Threads-Shared Data
In other words, schedule using core major order first followed by node major order. For example, when scheduling threads that share data on a dual-core Quartet system, AMD recommends using the following order: •...
Page 21: Away On An Idle System
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems distance. If they are indirectly connected to each other in a 4P configuration, it is considered as a 2 hop access distance. The following example—extracted from mining the results of the synthetic test case—substantiates the recommendation to keep data local.
Page 22: Keeping Data Local By Virtue Of First Touch
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems T im e f o r w r ite 1 .8 1 .6 1 4 9 % 1 .4 12 9% 1 27 % 113 % 1 .2...
Page 23: Data Placement Techniques To Alleviate Unnecessary Data Sharing
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is enough physical memory available on that node. If enough physical memory is not available on the node, then various advanced techniques are used to determine where to place the data, depending on the OS.
Page 24 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems afterwords no longer needs the data structure and if only one of the worker threads needs the data structure. In other words, the data structure is not truly shared between the worker threads.
Page 25: Avoid Cache Line Sharing
Data can often be restructured so that such cache-line sharing does not occur. Cache lines on AMD Athlon™ 64 and AMD Opteron™ processors are currently 64 bytes, but a scheme that avoids this problem, regardless of cache-line size, makes for more performance-portable code. For example, a multithreaded application should avoid using statically defined shared arrays and variables that are potentially located in a single cache line and shared between threads.
Page 26: Figure 6. Crossfire 1 Hop-1 Hop Case Vs No Crossfire 1 Hop-1 Hop Case On An Idle System
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems • Threads firing at each other (crossfire) The first thread runs on node 0 and writes to memory on node 1 (1 hop). The second thread runs on node 1 and writes to memory on node 0 (1 hop).
Page 27: Figure 7. Crossfire 1 Hop-1 Hop Case Vs No Crossfire 1 Hop-1 Hop Case Under A
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Here the same two foreground threads as before were run though the cases as before—local, crossfire and no crossfire. In addition, four background threads are left running on: •...
Page 28: Figure 8. Crossfire 1 Hop-1 Hop Case Vs No Crossfire 1 Hop-1 Hop Case Under A Very High Background Load (High Subscription)
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems VERY HIGH: Total Time for both threads (write-write) 195% 186% 158% 1 Hop 1 Hop 1 Hop 0 Hop 1 Hop 0 Hop...
Page 29: Myth: Greater Hop Distance Always Means Slower Time
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems VERY HIGH: Total Time for both threads (write-write) 216% 202% 156% 1 Hop 1 Hop 1 Hop 0 Hop 1 Hop 0 Hop...
Page 30: Figure 10. Both Read-Only Threads Running On Node 0 (Different Cores) On An Idle System
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems This analogy clearly communicates the performance effects of queuing time versus latency. In a computer server, with many concurrent outstanding memory requests, we would gladly incur some...
Page 31: Figure 11. Both Write-Only Threads Running On Node 0 (Different Cores) On An Idle System
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems However, as shown in Figure 11 on page 31, when both threads are write-only, the 0 hop-1 hop and 0 hop-2 hop cases are faster than the 0 hop-0 hop case.
Page 32: Figure 12. Both Write-Only Threads Running On Node 0 (Different Cores) Under
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems In addition, three background threads are running on nodes 1, 2 and 3. Each of these background threads access data locally. The rate of memory demand by each these threads is varied simultaneously from low to medium to high to very high as shown in Table 1 on page 16.
Page 33: Figure 13. Both Write-Only Threads Running On Node 0 (Different Cores) Under Medium
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Medium: Total Time for both threads (write-write) 146% 139% 129% 129% 0 Hop 0 Hop 0 Hop 0 Hop 0 Hop 1 Hop...
Page 34: Locks
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Very High: Total Time for both threads (write-write) 169% 158% 158% 147% 0 Hop 0 Hop 0 Hop 0 Hop 0 Hop 1 Hop...
Page 35: Parallelism Exposed By Compilers On Amd Ccnuma Multiprocessor Systems
Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems Several compilers for AMD multiprocessor systems provide additional hooks to allow automatic parallelization of otherwise serial programs. Several compilers also support the OpenMP API for parallel programming. For details about support for auto parallelization and OpenMP in various compilers, see the references [4], [14], [15] and [16].
Page 36 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Analysis and Recommendations Chapter 3...
Page 37: Conclusions
When scheduling multiple threads that mostly share data with each other on an idle dual-core AMD multiprocessor system, schedule threads on both cores of an idle node first and then move on to the next idle node and so on. In other words, schedule using core major order first followed by node major order.
Page 38 The buffer queues constitute one such resource. The lengths of these queues are configured by the BIOS with some hardware-specific limits that are specified in the BIOS Kernel and Developers Guide for the particular processor. Following AMD recommendations, the BIOS allocates these buffers on a link-by-link basis to optimize for the most common workloads.
Page 39: Appendix A
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Appendix A The following sections provide additional explanatory information on topics discussed in the previous sections of this document. Description of the Buffer Queues Figure 16 shows the internal resources in each Quartet node.
Page 40: What Resources Are Used When A Single Read-Only Or Write-Only Thread Accesses Remote Data
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Likewise packets to be transmitted from the MCT to the XBar are queued in the “MCT-to-XBar” buffers. The buffers in the SRI, XBar and MCT can be viewed as staggered queues on the various units.
Page 41: What Role Do Buffers Play In The Throughput Observed
The buffer lengths are BIOS configurable within some hardware-specific limits that are specified in the appropriate BIOS Kernel and Developers Guide for the processor under consideration. Following AMD recommendations, the BIOS allocates these buffers on a link-by-link basis to optimize for the most common workloads.
Page 42: Why Is The No Crossfire Case Slower Than The Crossfire Case On A System Under A Very High Background Load (Full Subscription)
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Why Is the No Crossfire Case Slower Than the Crossfire Case on a System under a Very High Background Load (Full Subscription)? When the threads are firing at each other (crossfire) and all other free cores are running background...
Page 43: Why Is 0 Hop-1 Hop Case Slower Than 0 Hop-0 Hop Case On A System Under High Background Load (High Subscription) For Write-Only Threads
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Why Is 0 Hop-1 Hop Case Slower Than 0 Hop-0 Hop Case on a System under High Background Load (High Subscription) for Write-...
Page 44: Multiprocessor Systems
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Tools and APIs for Thread/Process and Memory Placement (Affinity) for AMD64 ccNUMA Multiprocessor Systems This following sections discuss tools and APIs available for assigning thread/process and memory affinity under various operating systems.
Page 45: Support Under Solaris
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Controlling Memory Affinity Both numactl and libnuma library functions can be used to set memory affinity[5]. Memory affinity set by tools like numactl applies to all the data accessed by the entire program (including child processes).
Page 46: Tools And Apis For Node Interleaving In Various Oss For Amd64 Ccnuma
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems The function to set memory affinity for a thread is VirtualAlloc( )[9]. This function gives the developer the choice to bind memory immediately on allocation or to defer binding until first touch.
Page 47: Node Interleaving Configuration In The Bios
A.8.4 Node Interleaving Configuration in the BIOS AMD Opteron™ and Athlon™ 64 ccNUMA multiprocessor systems can be configured in the BIOS to interleave all memory across all nodes on a page basis (4KB for regular pages and 2M for large pages).
Page 48 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ 40555 Rev. 3.00 June 2006 ccNUMA Multiprocessor Systems Appendix A...

This manual is also suitable for:

Amd opteron Opteron ccnuma

AMD Athlon 64 Manuallines

Revision History

Chapter 1 Introduction

Chapter 2 Experimental Setup

Chapter 3 Analysis and Recommendations

Chapter 4 Conclusions

Appendix A

Quick Links

Application Note

Need help?

Questions and answers

Related Manuals for AMD AMD Athlon 64

Summary of Contents for AMD AMD Athlon 64

This manual is also suitable for:

Table of Contents