Document Title

Polynomial approximations to gamma

Michael Herf
April 5, 2000
back

Gamma correction is an important element of high-quality imaging. I'll assume you know the reasons.

I started with a goal: incorporate into high-speed MMX imaging loops the ability to do fast gamma approximations. The problem is that the standard LUT-based method (lookup per component) is actually relatively slow, especially if you're doing image processing in MMX.

I wrote a little solver to minimize error for quadratic and cubic polynomial approximations to the standard gamma formula. The result is that I found minimum error approximations to gamma = 2.2 subject to the constraint that f(0) = 0 and f(1) = 1. For quadratics, this means you have to minimize over one variable, and for cubics it means two.

Least-error polynomials

My results are these:

Notice that the scalar term is always 0 to satisfy f(0) = 0, 
also a + b = 1 for the quadratic and a + b + c = 1 for the cubic:

quadratic: ax^2 + bx = 0,        where a = -0.9192, b = 1.9192
cubic:     ax^3 + bx^2 + cx = 0, where a = 1.49, b = -3.23, and c = 2.74.

I measured error as the area between the curves and the correct gamma term. The areas here are 0.06 and 0.03, respectively. At gamma 2.2, the equivalent "error" for doing no correction is 0.375. So, these approximations aren't perfect, but they help a lot.

Speed

The next step was verifying that these methods are indeed fast on MMX. I implemented a fast LUT (doing 32-bit writes only), and measured its throughput on my p3/600. Then I implemented both of the above solutions using MMX.

LUT: ~20Mpixels/sec
Quadratic: ~62Mpixels/sec
Cubic: ~37MPixels/sec

This is actually generous, a best-case comparison for the LUT, because if you do image processing in MMX, there is extra overhead to move the bits to the integer side of the chip.

Now, here's some code to show how to implement the polynomial relatively quickly. There's lots of room for speedup here -- processing two pixels at a time, better pairing. Also, 2.2 gamma isn't that useful -- most people will want to do something more in-between. But the proof of concept is that we could have gamma blts in software without too much slowdown!

Notes: this code depends on Microsoft's UINT64 type, which is in windows.h -- also uses their inline assembly format. Also, my hack to keep 0 at 0 and 255 at 255 is really bad, but it's fast.


UINT64 almost = 0x00FF00FF00FF00FF;

                                        // gamma 2.2 constants for quadratic
UINT64 quada  = 0xFF8BFF8BFF8BFF8B;     // a = -0.9192 * 7b
UINT64 quadb  = 0x00F500F500F500F5;     // b =  1.9192 * 7b


                                        // gamma 2.2 constants for cubic
UINT64 cubea  = 0x005F005F005F005F;     // a =  1.49 * 6b
UINT64 cubeb  = 0xFF32FF32FF32FF32;     // b = -3.23 * 6b
UINT64 cubec  = 0x00AF00AF00AF00AF;     // b =  2.74 * 6b


void GammaQuadArray22(uint32 *p, uint32 c)
{
	__asm {
		pxor mm7, mm7
		mov  eax, p
		mov  ebx, c

		movq mm2, almost
		movq mm3, quada
		movq mm4, quadb

looper:
		movd mm0, [eax]
		punpcklbw mm0, mm7

		movq mm1, mm0

		pmullw mm0, mm0
		paddw mm0, mm2	// preserves 0 and 1 (=255)
		psrlw mm0, 8

		// mm0 has x^2, mm1 has x
		pmullw mm1, mm4
		pmullw mm0, mm3

		paddw mm0, mm1
		psrlw mm0, 7

		packuswb mm0, mm7
		movd [eax], mm0

		add eax, 4

		dec ebx
		jnz looper
	}

	__asm emms
}


void GammaCubeArray22(uint32 *p, uint32 c)
{
	__asm {
		pxor mm7, mm7
		mov eax, p
		mov ebx, c

		movq mm3, almost
		movq mm4, cubea
		movq mm5, cubeb
		movq mm6, cubec

looper:
		movd mm0, [eax]

		punpcklbw mm0, mm7
		movq mm1, mm0
		movq mm2, mm0

		pmullw mm1, mm1
		paddw mm1, mm3	// preserves 0 and 1 (=255)
		psrlw mm1, 8

		pmullw mm0, mm1
		paddw mm0, mm3
		psrlw mm0, 8

		// mm0 has x^3, mm1 has x^2, mm2 has x
		pmullw mm0, mm4;
		pmullw mm1, mm5;
		pmullw mm2, mm6;

		paddw mm0, mm1
		paddw mm0, mm2

		psrlw mm0, 6

		packuswb mm0, mm7
		movd [eax], mm0

		add eax, 4
		dec ebx
		jnz looper
	}

	__asm emms
}