### Polynomial approximations to gamma

Michael Herf
April 5, 2000

back

Gamma correction is an important element of high-quality imaging. I'll assume you know the reasons.

I started with a goal: incorporate into high-speed MMX imaging loops the ability to do fast gamma approximations. The problem is that the standard LUT-based method (lookup per component) is actually relatively slow, especially if you're doing image processing in MMX.

I wrote a little solver to minimize error for quadratic and cubic polynomial approximations to the standard gamma formula. The result is that I found minimum error approximations to gamma = 2.2 subject to the constraint that f(0) = 0 and f(1) = 1. For quadratics, this means you have to minimize over one variable, and for cubics it means two.

Least-error polynomials

My results are these:

```Notice that the scalar term is always 0 to satisfy f(0) = 0,
also a + b = 1 for the quadratic and a + b + c = 1 for the cubic:

quadratic: ax^2 + bx = 0,        where a = -0.9192, b = 1.9192
cubic:     ax^3 + bx^2 + cx = 0, where a = 1.49, b = -3.23, and c = 2.74.
```
I measured error as the area between the curves and the correct gamma term. The areas here are 0.06 and 0.03, respectively. At gamma 2.2, the equivalent "error" for doing no correction is 0.375. So, these approximations aren't perfect, but they help a lot.

Speed

The next step was verifying that these methods are indeed fast on MMX. I implemented a fast LUT (doing 32-bit writes only), and measured its throughput on my p3/600. Then I implemented both of the above solutions using MMX.

1. LUT: ~20Mpixels/sec
3. Cubic: ~37MPixels/sec
This is actually generous, a best-case comparison for the LUT, because if you do image processing in MMX, there is extra overhead to move the bits to the integer side of the chip.

Now, here's some code to show how to implement the polynomial relatively quickly. There's lots of room for speedup here -- processing two pixels at a time, better pairing. Also, 2.2 gamma isn't that useful -- most people will want to do something more in-between. But the proof of concept is that we could have gamma blts in software without too much slowdown!

Notes: this code depends on Microsoft's UINT64 type, which is in windows.h -- also uses their inline assembly format. Also, my hack to keep 0 at 0 and 255 at 255 is really bad, but it's fast.

```
UINT64 almost = 0x00FF00FF00FF00FF;

// gamma 2.2 constants for quadratic
UINT64 quada  = 0xFF8BFF8BFF8BFF8B;     // a = -0.9192 * 7b
UINT64 quadb  = 0x00F500F500F500F5;     // b =  1.9192 * 7b

// gamma 2.2 constants for cubic
UINT64 cubea  = 0x005F005F005F005F;     // a =  1.49 * 6b
UINT64 cubeb  = 0xFF32FF32FF32FF32;     // b = -3.23 * 6b
UINT64 cubec  = 0x00AF00AF00AF00AF;     // b =  2.74 * 6b

{
__asm {
pxor mm7, mm7
mov  eax, p
mov  ebx, c

movq mm2, almost

looper:
movd mm0, [eax]
punpcklbw mm0, mm7

movq mm1, mm0

pmullw mm0, mm0
paddw mm0, mm2	// preserves 0 and 1 (=255)
psrlw mm0, 8

// mm0 has x^2, mm1 has x
pmullw mm1, mm4
pmullw mm0, mm3

psrlw mm0, 7

packuswb mm0, mm7
movd [eax], mm0

dec ebx
jnz looper
}

__asm emms
}

void GammaCubeArray22(uint32 *p, uint32 c)
{
__asm {
pxor mm7, mm7
mov eax, p
mov ebx, c

movq mm3, almost
movq mm4, cubea
movq mm5, cubeb
movq mm6, cubec

looper:
movd mm0, [eax]

punpcklbw mm0, mm7
movq mm1, mm0
movq mm2, mm0

pmullw mm1, mm1
paddw mm1, mm3	// preserves 0 and 1 (=255)
psrlw mm1, 8

pmullw mm0, mm1
psrlw mm0, 8

// mm0 has x^3, mm1 has x^2, mm2 has x
pmullw mm0, mm4;
pmullw mm1, mm5;
pmullw mm2, mm6;

psrlw mm0, 6

packuswb mm0, mm7
movd [eax], mm0