stereopsis : sse

SSE from Brent

Brent Elliott's my brother-in-law & friend from way back, and he works at Intel now. He really helped me out with using some basic SSE functions, particularly prefetch, under VC6.
Now of course, the "right way" to do this stuff is to actually get the VTune compiler, and use that, but it makes it hard for "everybody else" to compile your code. It's rumored that Microsoft will release SSE support one of these days, but as of now, there is none.
As a result, I got Brent to show me how to do this stuff manually. It's a lot of work, but it's worth it for super-critical loops that you're writing in assembly. In my imaging library, there about about 3-4 routines that use prefetch, for about a 30% overall increase in speed on a P3. From what I can tell, this causes only a tiny slowdown on a P2, so I leave it in for those machines too.
I have a tiled compositor routine that can read from main memory (not cache) and do 38MPixels in software using P3 w/ prefetch. (Writing to L1, of course.) That's just really fast!
Anyway, here's the email from Brent that started it all:
From: "Elliott, Brent J" To: "'Michael Herf'" Subject: RE: prefetch macros... Date: Tuesday, January 18, 2000 7:04 PM Mike, Simple Case As a quick summary, if you are interested only in NTA with the memory address in EAX and prefetching 2 cache lines ahead of the EAX value. This is the case where EAX is your counter in a for loop that points to the memory address you are calculating with in the current iteration. You will thus be prefetching the cache line you will be operating on two iterations from now. your prefetch macro will be "0F 18 40 40" which comes out to #define prefetchNTA_EAX __asm __emit 0f __asm __emit 18 __asm __emit 40 __asm __emit 40 Then you can use the macro anytime you want to prefetch 2 cache lines ahead of the address in EAX. Now you can obviously do this with variables as well in which case you will be using the Indirect addressing mode instead. Things get a little complicated and the compiler tends to do things like, still do the same 8 bit offset type of operation relative to [ebp]. Details I feel like I just broke an encryption algorithm. The instruction manuals are a little vague, so I did some dissassembly, converted to binary, etc., etc. Instruction 0F 18 XX YY where XX (8 bits) is broken down into 2 bits, 3 bits, 3 bits. YY is the offset (this may be either 8 bits or 32 bits long) from the address in the register in offset addressing mode and is the address in Direct addressing mode. Addressing mode (First 2 bits) Indirect 00 8 bit offset 01 32 bit offset 10 Direc 11 Hint type (next 3 bits) NTA 000 T0 001 T1 010 T2 011 Register name (next 3 bits) EAX 000 ECX 001 EDX 010 EBX 011 (don't forget EBX is out of order here) ESP 100 EBP 101 ESI 110 EDI 111 -or- 000 if using direct memory access mode for example if you were doing prefetchNTA 64[eax] = 0F 18 40 40 (where XX = 01 (8 bit offset) 000 (NTA) 000 (EAX) ) prefetchT0 64[eax] = 0F 18 48 40 (where XX = 01 (8 bit offset) 001 (T0) 000 (EAX) ) prefetchT0 64[ecx] = 0F 18 49 40 (where XX = 01 (8 bit offset) 001 (T0) 001 (ECX) ) prefetchT0 0x1234 = 0F 18 C8 12 34 (where XX = 11 (Direct) 001 (T0) 000 (no register)) - this one I haven't checked but I don't think you would actually do it anyway. Brent Elliott xxx.xxx.xxxx (office) -----Original Message----- From: Michael Herf Sent: Monday, January 17, 2000 8:23 PM To: Elliott, Brent J Subject: prefetch macros... Since Microsoft doesn't seem to be forthcoming with VC7, do you think you could show me how to make prefetch macros (using emit) under VC6? This would really help our memory throughput for software rendering... I'd love to run 3x faster on P3. :) thanks, mike