A.4.1 Instruction Cache; Cache And Prefetch Optimizations; Instruction Cache; Cache Miss Cost - Intel XScale Core Developer's Manual

Table of Contents

Advertisement

A.4

Cache and Prefetch Optimizations

This section considers how to use the various cache memories in all their modes and then examines
when and how to use prefetch to improve execution efficiencies.
A.4.1

Instruction Cache

The Intel XScale
in the instruction cache even though both data and instructions may reside within the same memory
space with each other. Functionally, the instruction cache is either enabled or disabled. There is no
performance benefit in not using the instruction cache. The exception is that code, which locks
code into the instruction cache, must itself execute from non-cached memory.
A.4.1.1.

Cache Miss Cost

The Intel XScale
the Intel XScale
more information on the cycle penalty associated with cache misses. Note that this cycle penalty
becomes significant when the core is running much faster than external memory. Executing
non-cached instructions severely curtails the processor's performance in this case and it is very
important to do everything possible to minimize cache misses.
A.4.1.2.

Round-Robin Replacement Cache Policy

Both the data and the instruction caches use a round robin replacement policy to evict a cache line.
The simple consequence of this is that at sometime every line will be evicted, assuming a
non-trivial program. The less obvious consequence is that predicting when and over which cache
lines evictions take place is very difficult to predict. This information must be gained by
experimentation using performance profiling.
A.4.1.3.

Code Placement to Reduce Cache Misses

Code placement can greatly affect cache misses. One way to view the cache is to think of it as 32 sets
of 32 bytes, which span an address range of 1024 bytes. When running, the code maps into 32 blocks
modular 1024 of cache space. Any sets, which are overused, will thrash the cache. The ideal situation
is for the software tools to distribute the code on a temporal evenness over this space.
This is very difficult if not impossible for a compiler to do. Most of the input needed to best
estimate how to distribute the code will come from profiling followed by compiler based two pass
optimizations.
Developer's Manual
®
core has separate instruction and data caches. Only fetched instructions are held
®
core performance is highly dependent on reducing the cache miss rate. Refer to
®
core implementation option section of the ASSP architecture specification for
January, 2004
Intel XScale® Core Developer's Manual
Optimization Guide
191

Advertisement

Table of Contents
loading
Need help?

Need help?

Do you have a question about the XScale Core and is the answer not in the manual?

Table of Contents