Michael Herf
SreeD, Sree Kotay's rocking 3-D library, uses a memory-saving device to store diffuse + specular light maps. These are inspired by the effect of the Screen, Soft Light, and Hard Light apply modes in Photoshop. The idea is to have one map where a value of 0.5 keeps the destination the same, 0.0 drives it to zero, and 1.0 brightens it.
MetaStream 3 -- one surface texture, one environment map.
In real life, black objects can have specular highlights. And multiplies can't handle this.
There are, of course, many models for lighting. A reasonable one for realtime is the following:
diffuse_lightmap * diffuse_texture * diffuse_color + specular_lightmap * gloss_map * specular_color.Now, this is a lot of multiplies for a software renderer, and it's hard to evaluate in hardware, since hardware does not support associativity (i.e. there's no evaluation stack, so deferring the add is hard.)
So, dropping a few terms, you come up with:
diffuse_lightmap * diffuse_texture * diffuse_color + specular_lightmap.This means that you have to pre-render specular color into a map to get the right color, but doing it on the fly isn't possible using most hardware today.
Sree noticed that it's rare to have highlights where there are shadows (in fact, systems that allow this, like shadow z-buffers, are generally wrong), and vice versa. So we collapse the lightmap into one texture, split by a conditional per pixel:
lightmap *= 2; if (lightmap >= 1.0) result += (lightmap - 1.0) * 2; // clamped add else result *= lightmap; // standard lightmapThis does a modulate (like Quake) for values below 0.5, returns the original color at 0.5, and adds the texture map above 0.5.
This saves memory and only drops one bit (we still have 7). Also it's continuous -- you can make a visually-pleasing map, and the two piecewise functions meet at lightmap = 0.5, where you get x*1 and x+0. Also, for people just learning to create maps, it tends to give better-looking results, because we don't get specular highlights in places where there is no light. It's not quite as flexible, but is a very good simplification.
Unfortunately, graphics hardware can't evaluate this directly, so we have to split each of our textures into two to upload to hardware. It works in 3 "logical" passes on hardware right now (including a surface texture.) On newer shading hardware (including NVIDIA and ATI Radeon), we can evaluate everything at once, so it's a single pass (in terms of time), but it's still using twice as much memory as the software version does.
I'm providing sample implementations here -- first is a reference version, implemented slowly (Sree's C implementation is much faster, but this is just to illustrate how the code works.) The inputs are ( (0..511), (0..255), (0..255), (0..255) ), so the procedural blend value is pre-scaled by the caller.
Then I did a fast MMX version of this operation. It runs in about 10 cycles/pixel.
You have to call MMXPreProcBlend once, before MMXProcBlend, but after any floating point code. You're also obligated to call __asm emms sometime to flush the MMX state.
typedef long int32; typedef unsigned long uint32; // 34 cycles uint32 ProcBlend(uint32 p, uint32 r, uint32 g, uint32 b) { uint32 nr = p >> 16 & 0xFF; uint32 ng = p >> 8 & 0xFF; uint32 nb = p & 0xFF; if (r > 255) nr += (r - 256); else nr = nr * r >> 8; if (g > 255) ng += (g - 256); else ng = ng * g >> 8; if (b > 255) nb += (b - 256); else nb = nb * b >> 8; if (nr > 255) nr = 255; if (ng > 255) ng = 255; if (nb > 255) nb = 255; uint32 final = (nr << 16) + (ng << 8) + nb; return final; } UINT64 maxmask = 0x010001000100; UINT64 minmask = 0xFEFFFEFFFEFF; // call after fp code, before MMXProcBlend __forceinline void MMXPreProcBlend() { __asm { pxor mm7, mm7 movq mm6, maxmask movq mm5, minmask } } // ~9.75 cycles __forceinline uint32 MMXProcBlend(uint32 p, uint32 r, uint32 g, uint32 b) { uint32 retval; __asm { movd mm0, p movd mm3, r movd mm2, g movd mm1, b punpcklwd mm1, mm3 punpcklwd mm1, mm2 // copy movq mm2, mm1 // unpack sometime punpcklbw mm0, mm7 // mult stuff [mmx compatible, though there is a faster SSE way] paddusw mm1, mm5 // clamp to 256 (not 255!) // additive stuff psubusw mm2, mm6 // mm2 <- (v > 256 : v - 256 ? 0) psubusw mm1, mm5 // finish clamping to 256 // no unsigned hiword mul in MMX (just SSE), so we have to do this! pmullw mm0, mm1 psrlw mm0, 8 paddusw mm0, mm2 packuswb mm0, mm7 movd retval, mm0 } return retval; }