Gamma correction is an important element of high-quality imaging. I'll assume you know the reasons.
I started with a goal: incorporate into high-speed MMX imaging loops the ability to do fast gamma approximations. The problem is that the standard LUT-based method (lookup per component) is actually relatively slow, especially if you're doing image processing in MMX.
I wrote a little solver to minimize error for quadratic and cubic polynomial approximations to the standard gamma formula. The result is that I found minimum error approximations to gamma = 2.2 subject to the constraint that f(0) = 0 and f(1) = 1. For quadratics, this means you have to minimize over one variable, and for cubics it means two.
Least-error polynomials
My results are these:
Notice that the scalar term is always 0 to satisfy f(0) = 0, also a + b = 1 for the quadratic and a + b + c = 1 for the cubic: quadratic: ax^2 + bx = 0, where a = -0.9192, b = 1.9192 cubic: ax^3 + bx^2 + cx = 0, where a = 1.49, b = -3.23, and c = 2.74.I measured error as the area between the curves and the correct gamma term. The areas here are 0.06 and 0.03, respectively. At gamma 2.2, the equivalent "error" for doing no correction is 0.375. So, these approximations aren't perfect, but they help a lot.
Speed
The next step was verifying that these methods are indeed fast on MMX. I implemented a fast LUT (doing 32-bit writes only), and measured its throughput on my p3/600. Then I implemented both of the above solutions using MMX.
Now, here's some code to show how to implement the polynomial relatively quickly. There's lots of room for speedup here -- processing two pixels at a time, better pairing. Also, 2.2 gamma isn't that useful -- most people will want to do something more in-between. But the proof of concept is that we could have gamma blts in software without too much slowdown!
Notes: this code depends on Microsoft's UINT64 type, which is in windows.h -- also uses their inline assembly format. Also, my hack to keep 0 at 0 and 255 at 255 is really bad, but it's fast.
UINT64 almost = 0x00FF00FF00FF00FF; // gamma 2.2 constants for quadratic UINT64 quada = 0xFF8BFF8BFF8BFF8B; // a = -0.9192 * 7b UINT64 quadb = 0x00F500F500F500F5; // b = 1.9192 * 7b // gamma 2.2 constants for cubic UINT64 cubea = 0x005F005F005F005F; // a = 1.49 * 6b UINT64 cubeb = 0xFF32FF32FF32FF32; // b = -3.23 * 6b UINT64 cubec = 0x00AF00AF00AF00AF; // b = 2.74 * 6b void GammaQuadArray22(uint32 *p, uint32 c) { __asm { pxor mm7, mm7 mov eax, p mov ebx, c movq mm2, almost movq mm3, quada movq mm4, quadb looper: movd mm0, [eax] punpcklbw mm0, mm7 movq mm1, mm0 pmullw mm0, mm0 paddw mm0, mm2 // preserves 0 and 1 (=255) psrlw mm0, 8 // mm0 has x^2, mm1 has x pmullw mm1, mm4 pmullw mm0, mm3 paddw mm0, mm1 psrlw mm0, 7 packuswb mm0, mm7 movd [eax], mm0 add eax, 4 dec ebx jnz looper } __asm emms } void GammaCubeArray22(uint32 *p, uint32 c) { __asm { pxor mm7, mm7 mov eax, p mov ebx, c movq mm3, almost movq mm4, cubea movq mm5, cubeb movq mm6, cubec looper: movd mm0, [eax] punpcklbw mm0, mm7 movq mm1, mm0 movq mm2, mm0 pmullw mm1, mm1 paddw mm1, mm3 // preserves 0 and 1 (=255) psrlw mm1, 8 pmullw mm0, mm1 paddw mm0, mm3 psrlw mm0, 8 // mm0 has x^3, mm1 has x^2, mm2 has x pmullw mm0, mm4; pmullw mm1, mm5; pmullw mm2, mm6; paddw mm0, mm1 paddw mm0, mm2 psrlw mm0, 6 packuswb mm0, mm7 movd [eax], mm0 add eax, 4 dec ebx jnz looper } __asm emms }