The "double blend trick"

Michael Herf
May 2000

Or, how to scale 4 numbers, or lerp 8 numbers, in only 2 multiplies

Lots of the pixel- friendly people I've worked with seem to know (and use) this simple trick for fast packed blending. Still, though none of us (Sree or Ben or me) invented it, I don't think it's incredibly well-known in general. For instance, I think that that Crystal Space uses these ideas in their software rasterizer, but Gnome and Netscape 6 do not.

Of course, if you're writing for PCs, learn MMX, and be done with it. If you want to write portable code, read up!

If you're doing alpha compositing, you really need to read Alvy Ray Smith's article about compositing at Also, Jim Blinn's Dirty Pixels is a good discussion about the finer points of how to multiply, and another great discussion about premultiplied alpha.

What's here

First, I'll talk about fast scaling of pixels, which is something you'll want for a premultiplied alpha blending library, or for pixel darkening, multiply apply modes, etc. Then, I'll talk about fast lerping between pairs of pixels, which is something you'll want for unpremultiplied alpha, lerping, and for related things like bilinear & trilinear interpolation, etc.

Since this article is about a fast way to deal with packed pixels (this is when the technique is most useful), I'll talk about pixel formats first. A majority of my image-processing code uses the 8-8-8-8 ARGB format (BGRA in little-endian unsigned char order), because that's the native GWorld format on the Mac, and it's compatible with 32-bit bitblts on the PC.

In other words, you can get it to the screen efficiently, and there's enough precision to do simple imaging operations. I think that this format is the best-supported format if you're writing your own library. i.e. you'll get consistently good performance across multiple platforms. I know that After Effects and all the software at MetaCreations use ARGB. Also, Microsoft's new alpha blts in Win2000, the GDI+ specification, and DXSURFACE all use premultiplied ARGB, so ARGB is the right thing by monopoly vote.

Specifically, doing the more classic RGBA format is less widely supported, mostly due to the Mac. See blttest for sample code on Windows.

In fact, I've done the "fast thing" on both platforms by implementing a struct called Pixel32 that does optimal things for RISC and x86 machines:

typedef unsigned char uint8;
typedef unsigned long uint32;

struct Pixel32 {

    union {
        struct {
#if LittleEndian_
            uint8 b, g, r, a;
            uint8 a: 8;
            uint8 r: 8;
            uint8 g: 8;
            uint8 b: 8;
        uint32 u;
Let me point out the non-obvious things about this class. First, the ordering of the struct means that the code will work identically on little and big endian machines. i.e. if you use this struct, you never have to worry about endian issues, period.

Second, for single component loads, PCs are still faster at loading 8 bits at a time than using 32-bit operations to do the same, because the shift and mask together have a 2-cycle latency. On more RISC-like machines, it's typically better to use 32-bit operations only, so we use a bitfield for that implementation, which translates to pretty ideal masking and shifting. (Really -- I've disassembled it.)

Scaling 8-bit packed numbers

This idea is similar to using SIMD (like MMX and Altivec), except 32 bits at a time, and without overflow control (i.e. we have to do everything manually.)

So, the first observation is that scaling an unsigned number in packed format is easy to do in 32 bits. In fact, 4 components can be processed in two multiplies (this is pretty easy):

static inline uint32 Scale32(uint32 scale, uint32 p) {

	uint32 ag = (p & 0xFF00FF00) >> 8;
	uint32 rb =  p & 0x00FF00FF;

	uint32 sag = scale * ag;
	uint32 srb = scale * rb;

	sag = sag & 0xFF00FF00;
	srb = srb >> 8 & 0x00FF00FF;

	return sag | srb;

This does a fixed point scale (and "scale" is expected to be 0..256, not 0..255). What happened here is we masked the alpha/green components, and scaled them using operations that only touch 16 bits. i.e. we computed:
scale*a  scale*g
scale*r  scale*b
and then reduced each to its 8 most significant bits.

This technique can be used to implement Jim Blinn & Alvy Ray Smith's accurate scaling method for 8-bit (0..255-valued) inputs. (Ch. 19 of "Dirty Pixels", "Three Wrongs Make a Right," Jim Blinn.)

#define rbmask 0xFF00FF

// This code expects "a8: 0..255" and "b32: [16 / 16], each with low bits 0..255"
static inline uint32 Mul8x2(uint32 a8, uint32 b32) {
	b32 &= rbmask;
	uint32 i = a8 * b32 + 0x800080;
	return (i + ((i >> 8) & rbmask)) >> 8 & rbmask;
To handle four components, we simply call the code like this:
static inline void Over(uint32 *dp, uint32 front, uint32 back) {

	Pixel32 *p = (Pixel32*)&front;
	uint32 alpha = 255 - p->a;

	uint32 rbd = Mul8x2(alpha, back);
	uint32 agd = Mul8x2(alpha, back >> 8) << 8;

	*dp = rbd + agd + front;
There's one other trick before we move on to lerping. Blinn's thing is still a little expensive in some cases, though he explains how it's really perfect for all inputs. The problem is twofold: first, we're trying to multiply two not-quite-8-bit numbers by pretending they're 8 bits, and next we're losing precision by using a shift and not rounding.

The simpler version of the problem is to satisfy what I'll call the "OpenGL criteria". OpenGL defines the blending operator such that (0.0 * 0.0 = 0.0) and (1.0 * 1.0 = 1.0). (In base-255, that means that 255*255 = 255.)

Obviously, if we take 255 * 255 >> 8, we get 254 almost exactly. Not right.

The cheapest solution I know to that problem is Sree's. Sree does this (pseudo-code):

uint32 Mul255(uint32 a255, uint32 c255) {
    return (a255 + 1) * c255 >> 8;
This satisfies the OpenGL criteria, without really costing much at all. So, add 1 to alpha, and you'll be happy. I use this in my code, since it really is cheaper than Blinn's stuff. Heck, if you need accuracy, use 16-bit components.

Lerping 8-bit packed numbers

This is the one I think fewer people know about.

Here we'll be illustrating with the non-associated alpha blend (or lerp), which you can easily extend to implement optimal bilinear or trilinear, which use 3 and 7 lerps, respectively (that's 6 or 14 muls).

However, I've been writing for a while, and I think I'm going to just drop some code and let you figure out the details.

// a is 0..256
// ================================================================================================
// 'a' is base-256, not 255, s alpha is ignored
// ================================================================================================
static finline void PixelLerp(uint32 a, uint32 &d, const uint32 s)
	uint32 dstrb = d      & 0xFF00FF;
	uint32 dstag = d >> 8 & 0xFF00FF;

	uint32 srcrb = s      & 0xFF00FF;
	uint32 srcag = s >> 8 & 0xFF00FF;

	uint32 drb = srcrb - dstrb;
	uint32 dag = srcag - dstag;

	drb *= a;  
	dag *= a;  
	drb >>= 8;
	dag >>= 8;

	const uint32 rb  = (drb + dstrb)      & 0x00FF00FF;
	const uint32 ag  = (dag + dstag) << 8 & 0xFF00FF00;

	d = rb | ag;
The unsigned multiply actually does the right thing here, a non-obvious result. You can prove it with some two's-complement work.

If you only care about RGB, you can save one shift (this one uses source alpha):

static inline void PixelBlend(uint32 &d, const uint32 s)
	const uint32 a     = (s >> 24) + 1;

	const uint32 dstrb = d & 0xFF00FF;
	const uint32 dstg  = d & 0xFF00;

	const uint32 srcrb = s & 0xFF00FF;
	const uint32 srcg  = s & 0xFF00;

	uint32 drb = srcrb - dstrb;
	uint32 dg  =  srcg - dstg;

	drb *= a;
	dg  *= a;  
	drb >>= 8;
	dg  >>= 8;

	uint32 rb = (drb + dstrb) & 0xFF00FF;
	uint32 g  = (dg  + dstg) & 0xFF00;

	d = rb | g;
Finally, here's my unreadable RGB-only bilinear. This coding style is "help out the register allocator by reusing variables obsessively." It looks like it makes sense, but it's really completely unreadable.

Don't use this when you need alpha to be preserved:

#define rbmask 0xFF00FF
#define agmask 0x00FF00FF

static inline uint32 Bilerp32(uint32 a, uint32 b, uint32 c, uint32 d, uint32 xp, uint32 yp)
#define arb (a & rbmask)
#define brb (b & rbmask)
#define crb (c & rbmask)
#define drb (d & rbmask)

#define aag (a & agmask)
#define bag (b & agmask)
#define cag (c & agmask)
#define dag (d & agmask)

	uint32 agd1 = bag - aag;
	uint32 agd2 = dag - cag;
	uint32 rbd1 = brb - arb;
	agd1 >>= 8;
	uint32 rbd2 = drb - crb;
	agd2 >>= 8;

	rbd1 *= xp;  b = arb;
	agd1 *= xp;  rbd1 >>= 8;
	rbd2 *= xp;  d = crb;
	agd2 *= xp;  rbd2 >>= 8;

	b += rbd1;
	a += agd1;

	d += rbd2;
	c += agd2;

	b &= rbmask;
	a &= agmask;
	c &= agmask;
	d &= rbmask;

	// ---- d is now rb for the bottom
	//      c is     ag for the bottom
	//      b is     rb for the top
	//      a is     ag for the top

	c -= a;  // agd
	d -= b;  // rbd
	c >>= 8;

	d *= yp;
	c *= yp;
	d >>= 8;

	a += c;
	b += d;

	a &= agmask;
	b &= rbmask;

	return a | b;

#undef aag
#undef bag
#undef cag
#undef dag

#undef arb
#undef brb
#undef crb
#undef drb
Okay, I think that's a wrap. Have some fun, don't hurt yourself with the bilinear.