Fast Crossfades

Michael Herf
October 2000

I've decided to stop apologizing for using MMX everywhere. Everyone has it, and it's just too much fun.

I thought I'd put up some code to do wicked-fast crossfades -- I got a lot of attention at MetaCreations for all the image transition code I did there -- ahem, related, but not the same code. This is, uh, quite a bit faster. :)

This code has benchmarked at 103fps on my P3/600 for crossfading between two 640x480 images to a third. Even with bitblt, it can still do 60fps, which is fast enough.

Also, my brother-in-law Brent Elliott works at Intel, and he gave me some of the prefetch code on this page. Prefetch lets you tell the processor, "Hey dummy, go get this memory, I'm going to need it." It's worth maybe a 30% speedup here. Unfortunately, there's no compiler support, so you have to do it at the icky opcode level.

What I learned: saving multiplies isn't always worth it. This code does four multiplies instead of two, and it's better for it.

UINT64 neg64 = 0x00FF00FF00FF00FF;

// three cache line's fast here
#define pfNTA_ECX __asm __emit 0x0f __asm __emit 0x18 __asm __emit 0x41 __asm __emit 0x60
#define pfNTA_EDX __asm __emit 0x0f __asm __emit 0x18 __asm __emit 0x42 __asm __emit 0x60

// bitmap sizes must be multiple of 2 or you lose a pixel
error CrossFade(uint32 opac, Bitmap &result, Bitmap &src, Bitmap &dst)
	Rect area = result.Size();

	uint32 h = area.Height();
	uint32 w = area.Width();

	if (opac > 0) opac --;

	__asm {
		pxor mm7, mm7
		movd mm6, opac
		punpcklwd mm6, mm6
		punpckldq mm6, mm6
		movq mm5, mm6
		pxor mm5, neg64		

	// 2-pixel loop
	w /= 2;

	for (uint32 y = 0; y < h; y++) {
		uint32 *p0 = result.Pixel(0, y);
		uint32 *s0 = src.Pixel(0, y);
		uint32 *s1 = dst.Pixel(0, y);

		__asm {
			mov eax, w
			mov ebx, p0

			mov ecx, s0
			mov edx, s1
			movq mm0, [edx]
			movq mm1, [ecx]
			movq mm2, mm0
			movq mm3, mm1
			add edx, 8

			punpcklbw mm0, mm7
			punpcklbw mm1, mm7

			add ecx, 8
			punpckhbw mm2, mm7
			punpckhbw mm3, mm7

			pmullw mm0, mm5
			pmullw mm1, mm6
			pmullw mm2, mm5
			pmullw mm3, mm6

			paddw mm0, mm1
			psrlw mm0, 8

			paddw mm2, mm3
			psrlw mm2, 8

			packuswb mm0, mm2
			movq [ebx], mm0

			add ebx, 8

			dec eax
			jg pixelloop
	__asm emms

	return success;