Sree's procedural blend, my MMX version


Michael Herf

SreeD, Sree Kotay's rocking 3-D library, uses a memory-saving device to store diffuse + specular light maps. These are inspired by the effect of the Screen, Soft Light, and Hard Light apply modes in Photoshop. The idea is to have one map where a value of 0.5 keeps the destination the same, 0.0 drives it to zero, and 1.0 brightens it.

MetaStream 3 -- one surface texture, one environment map.

Light maps

"Light maps," as they're used in Quake, have a problem -- they can only darken a texture map. Some OpenGL vendors have started supporting 2X diffuse apply modes (i.e. lightmap * texture * 2, clamped), but this only solves part of the problem.

In real life, black objects can have specular highlights. And multiplies can't handle this.

There are, of course, many models for lighting. A reasonable one for realtime is the following:

  diffuse_lightmap * diffuse_texture * diffuse_color + specular_lightmap * gloss_map * specular_color.
Now, this is a lot of multiplies for a software renderer, and it's hard to evaluate in hardware, since hardware does not support associativity (i.e. there's no evaluation stack, so deferring the add is hard.)

So, dropping a few terms, you come up with:

  diffuse_lightmap * diffuse_texture * diffuse_color + specular_lightmap.
This means that you have to pre-render specular color into a map to get the right color, but doing it on the fly isn't possible using most hardware today.

One map?

I initially thought this was crazy, throwing away lots of information. But now I think Sree's right.

Sree noticed that it's rare to have highlights where there are shadows (in fact, systems that allow this, like shadow z-buffers, are generally wrong), and vice versa. So we collapse the lightmap into one texture, split by a conditional per pixel:

  lightmap *= 2;
  if (lightmap >= 1.0) result += (lightmap - 1.0) * 2;    // clamped add
  else result *= lightmap;                                // standard lightmap
This does a modulate (like Quake) for values below 0.5, returns the original color at 0.5, and adds the texture map above 0.5.

This saves memory and only drops one bit (we still have 7). Also it's continuous -- you can make a visually-pleasing map, and the two piecewise functions meet at lightmap = 0.5, where you get x*1 and x+0. Also, for people just learning to create maps, it tends to give better-looking results, because we don't get specular highlights in places where there is no light. It's not quite as flexible, but is a very good simplification.

Unfortunately, graphics hardware can't evaluate this directly, so we have to split each of our textures into two to upload to hardware. It works in 3 "logical" passes on hardware right now (including a surface texture.) On newer shading hardware (including NVIDIA and ATI Radeon), we can evaluate everything at once, so it's a single pass (in terms of time), but it's still using twice as much memory as the software version does.

I'm providing sample implementations here -- first is a reference version, implemented slowly (Sree's C implementation is much faster, but this is just to illustrate how the code works.) The inputs are ( (0..511), (0..255), (0..255), (0..255) ), so the procedural blend value is pre-scaled by the caller.

Then I did a fast MMX version of this operation. It runs in about 10 cycles/pixel.

You have to call MMXPreProcBlend once, before MMXProcBlend, but after any floating point code. You're also obligated to call __asm emms sometime to flush the MMX state.

typedef long int32;
typedef unsigned long uint32;

// 34 cycles
uint32 ProcBlend(uint32 p, uint32 r, uint32 g, uint32 b)
	uint32 nr = p >> 16 & 0xFF;
	uint32 ng = p >>  8 & 0xFF;
	uint32 nb = p       & 0xFF;

	if (r > 255) nr += (r - 256);
		else nr = nr * r >> 8;

	if (g > 255) ng += (g - 256);
		else ng = ng * g >> 8;

	if (b > 255) nb += (b - 256);
		else nb = nb * b >> 8;

	if (nr > 255) nr = 255;
	if (ng > 255) ng = 255;
	if (nb > 255) nb = 255;

	uint32 final = (nr << 16) + (ng << 8) + nb;
	return final;

UINT64 maxmask = 0x010001000100;
UINT64 minmask = 0xFEFFFEFFFEFF;

// call after fp code, before MMXProcBlend
__forceinline void MMXPreProcBlend()
	__asm {
		pxor mm7, mm7

		movq mm6, maxmask
		movq mm5, minmask

// ~9.75 cycles
__forceinline uint32 MMXProcBlend(uint32 p, uint32 r, uint32 g, uint32 b)
	uint32 retval;

	__asm {
		movd mm0, p

		movd mm3, r
		movd mm2, g
		movd mm1, b

		punpcklwd mm1, mm3
		punpcklwd mm1, mm2
		// copy
		movq mm2, mm1
		// unpack sometime
		punpcklbw mm0, mm7

		// mult stuff [mmx compatible, though there is a faster SSE way]
		paddusw mm1, mm5	// clamp to 256 (not 255!)

		// additive stuff
		psubusw mm2, mm6	// mm2 <- (v > 256 : v - 256 ? 0)
		psubusw mm1, mm5	// finish clamping to 256

		// no unsigned hiword mul in MMX (just SSE), so we have to do this!
		pmullw mm0, mm1
		psrlw mm0, 8
		paddusw mm0, mm2

		packuswb mm0, mm7

		movd retval, mm0

	return retval;