AMD Athlon Processor x86 Optimization Manual page 108

X86 code optimization
Table of Contents

Advertisement

AMD Athlon™ Processor x86 Code Optimization
Step 3
Step 4
92
For the first time, the value in each k-bit field is small enough
that adding two k-bit fields results in a value that still fits in the
k-bit field. Thus the following computation is performed:
y = (x + (x >> 4)) & 0x0F0F0F0F
The result is four 8-bit fields whose lower half has the desired
sum and whose upper half contains "junk" that has to be
masked out. In a symbolic form:
x
= 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
sum
= 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
The W W W W, X X X X , Y Y Y Y, a n d Z Z Z Z va l u e s a re t he
interesting sums with each at most 1000b, or 8 decimal.
The four 4-bit sums can now be rapidly accumulated by means
of a multiply with a "magic" multiplier. This can be derived
from looking at the following chart of partial products:
0p0q0r0s * 01010101 =
:0p0q0r0s
0p:0q0r0s
0p0q:0r0s
0p0q0r:0s
000pxxww:vvuutt0s
Here p, q, r, and s are the 4-bit sums from the previous step, and
vv is the final result in which we are interested. Thus, the final
result:
z = (y * 0x01010101) >> 24
Example:
unsigned int popcount(unsigned int v)
{
unsigned int retVal;
__asm {
MOV
EAX, [v]
MOV
EDX, EAX
SHR
EAX, 1
AND
EAX, 055555555h ;(v >> 1) & 0x55555555
SUB
EDX, EAX
MOV
EAX, EDX
SHR
EDX, 2
AND
EAX, 033333333h ;w & 0x33333333
AND
EDX, 033333333h ;(w >> 2)
;v
;v
;v >> 1
;w = v - ((v >> 1) & 0x55555555)
;w
;w >> 2
& 0x33333333
Efficient Implementation of Population Count Function
22007E/0—November 1999

Advertisement

Table of Contents
loading

Table of Contents