Brent Elliott's my brother-in-law & friend from way back, and he works at Intel now.
He really helped me out with using
some basic SSE functions, particularly prefetch, under VC6.
Now of course,
the "right way" to do this stuff is to actually
get the VTune compiler, and use that, but it makes it hard for "everybody else" to compile
your code. It's rumored that Microsoft will release SSE support one of these days, but
as of now, there is none.
As a result, I got Brent to show me how to do this stuff manually. It's a lot of work,
but it's worth it for super-critical loops that you're writing in assembly. In my
imaging library, there about about 3-4 routines that use prefetch, for about a 30% overall
increase in speed on a P3. From what I can tell, this causes only a tiny slowdown on a P2,
so I leave it in for those machines too.
I have a tiled compositor routine that can read from main memory (not cache)
and do 38MPixels in software using P3 w/ prefetch. (Writing to L1, of course.)
That's just really fast!
Anyway, here's the email from Brent that started it all:
From: "Elliott, Brent J"
To: "'Michael Herf'"
Subject: RE: prefetch macros...
Date: Tuesday, January 18, 2000 7:04 PM
As a quick summary, if you are interested only in NTA with the memory
address in EAX and prefetching 2 cache lines ahead of the EAX value. This is
the case where EAX is your counter in a for loop that points to the memory
address you are calculating with in the current iteration. You will thus be
prefetching the cache line you will be operating on two iterations from now.
your prefetch macro will be "0F 18 40 40"
which comes out to
#define prefetchNTA_EAX __asm __emit 0f __asm __emit 18 __asm __emit 40
__asm __emit 40
Then you can use the macro anytime you want to prefetch 2 cache lines ahead
of the address in EAX.
Now you can obviously do this with variables as well in which case you will
be using the Indirect addressing mode instead. Things get a little
complicated and the compiler tends to do things like, still do the same 8
bit offset type of operation relative to [ebp].
I feel like I just broke an encryption algorithm. The instruction manuals
are a little vague, so I did some dissassembly, converted to binary, etc.,
0F 18 XX YY
where XX (8 bits) is broken down into 2 bits, 3 bits, 3 bits. YY is the
offset (this may be either 8 bits or 32 bits long) from the address in the
register in offset addressing mode and is the address in Direct addressing
Addressing mode (First 2 bits)
8 bit offset 01
32 bit offset 10
Hint type (next 3 bits)
Register name (next 3 bits)
EBX 011 (don't forget EBX is out of order here)
000 if using direct memory access mode
for example if you were doing
prefetchNTA 64[eax] = 0F 18 40 40 (where XX = 01 (8 bit offset) 000 (NTA)
000 (EAX) )
prefetchT0 64[eax] = 0F 18 48 40 (where XX = 01 (8 bit offset) 001 (T0)
000 (EAX) )
prefetchT0 64[ecx] = 0F 18 49 40 (where XX = 01 (8 bit offset) 001 (T0)
001 (ECX) )
prefetchT0 0x1234 = 0F 18 C8 12 34 (where XX = 11 (Direct) 001 (T0)
000 (no register))
- this one I haven't checked but I don't think you would actually do
From: Michael Herf
Sent: Monday, January 17, 2000 8:23 PM
To: Elliott, Brent J
Subject: prefetch macros...
Since Microsoft doesn't seem to be forthcoming with VC7, do you think you
could show me how to make prefetch macros (using emit) under VC6?
This would really help our memory throughput for software rendering... I'd
love to run 3x faster on P3. :)