SSE from Brent


Brent Elliott's my brother-in-law & friend from way back, and he works at Intel now. He really helped me out with using some basic SSE functions, particularly prefetch, under VC6.

Now of course, the "right way" to do this stuff is to actually get the VTune compiler, and use that, but it makes it hard for "everybody else" to compile your code. It's rumored that Microsoft will release SSE support one of these days, but as of now, there is none.

As a result, I got Brent to show me how to do this stuff manually. It's a lot of work, but it's worth it for super-critical loops that you're writing in assembly. In my imaging library, there about about 3-4 routines that use prefetch, for about a 30% overall increase in speed on a P3. From what I can tell, this causes only a tiny slowdown on a P2, so I leave it in for those machines too.

I have a tiled compositor routine that can read from main memory (not cache) and do 38MPixels in software using P3 w/ prefetch. (Writing to L1, of course.) That's just really fast!

Anyway, here's the email from Brent that started it all:

From: "Elliott, Brent J"
To: "'Michael Herf'"
Subject: RE: prefetch macros...
Date: Tuesday, January 18, 2000 7:04 PM


Simple Case

As a quick summary, if you are interested only in NTA with the memory
address in EAX and prefetching 2 cache lines ahead of the EAX value. This is
the case where EAX is your counter in a for loop that points to the memory
address you are calculating with in the current iteration. You will thus be
prefetching the cache line you will be operating on two iterations from now.

your prefetch macro will be "0F 18 40 40"

which comes out to 
#define prefetchNTA_EAX	__asm __emit 0f __asm __emit 18 __asm __emit 40
__asm __emit 40

Then you can use the macro anytime you want to prefetch 2 cache lines ahead
of the address in EAX.

Now you can obviously do this with variables as well in which case you will
be using the Indirect addressing mode instead. Things get a little
complicated and the compiler tends to do things like, still do the same 8
bit offset type of operation relative to [ebp].


I feel like I just broke an encryption algorithm. The instruction manuals
are a little vague, so I did some dissassembly, converted to binary, etc.,

0F 18 XX YY

where XX (8 bits) is broken down into 2 bits, 3 bits, 3 bits. YY is the
offset (this may be either 8 bits or 32 bits long) from the address in the
register in offset addressing mode and is the address in Direct addressing

Addressing mode (First 2 bits)
Indirect      00 
8  bit offset 01
32 bit offset 10
Direc         11

Hint type (next 3 bits)
NTA 000
T0  001
T1  010
T2  011

Register name (next 3 bits)
EAX 000
ECX 001
EDX 010
EBX 011 (don't forget EBX is out of order here)
ESP 100
EBP 101
ESI 110
EDI 111
000 if using direct memory access mode

for example if you were doing
prefetchNTA 64[eax] = 0F 18 40 40    (where XX = 01 (8 bit offset) 000 (NTA)
000 (EAX) )
prefetchT0  64[eax] = 0F 18 48 40    (where XX = 01 (8 bit offset) 001 (T0)
000 (EAX) )
prefetchT0  64[ecx] = 0F 18 49 40    (where XX = 01 (8 bit offset) 001 (T0)
001 (ECX) )
prefetchT0  0x1234  = 0F 18 C8 12 34 (where XX = 11 (Direct)       001 (T0)
000 (no register)) 
	- this one I haven't checked but I don't think you would actually do
it anyway.

Brent Elliott (office)

-----Original Message-----
From: Michael Herf
Sent: Monday, January 17, 2000 8:23 PM
To: Elliott, Brent J
Subject: prefetch macros...

Since Microsoft doesn't seem to be forthcoming with VC7, do you think you
could show me how to make prefetch macros (using emit) under VC6?

This would really help our memory throughput for software rendering...  I'd
love to run 3x faster on P3.  :)