I posted a description of a depth of field implementation I had tried a couple of years ago to usenet. Here's the text.
[Repost from a few months ago below.]
I do know from my experiments that the "Dark Halos" Mark describes are generally avoidable by a simple normalization step (if you keep coverage info around.)
My implementation uses 1/A (area of disc) saturation alpha drawn front-to-back, with a post-normalization step (based on coverage). I believe that the occlusion function I use works as well as can be done from a pinhole camera. Obviously, highlights in the background will pop in and out due to their pointwise occlusion.
I think that for stills, these techniques are extremely viable. For animation, and if you're really concerned about quality, you want to do something that does true visibility from a non-pinhole camera.
I've played with this one a bit.
Potmesil published an article in the early 1980's that did something like your second suggestion. It was my experience that his general theory was sound, but there were one or two points in the paper that involved some voodoo. More on this in a second...
First of all, a pinhole camera cannot adequately capture the information needed for correct depth-of-field simulation. This is why the multi-sample lens methods are inherently better than any image-based solution (though much more expensive). Consider an object that can only be seen by the "edge" of the lens, but not by the center, and you understand the problem. However, there are better and worse ways to proceed with single image+Z methods.
The problem with blurring is that it is not correct, due again to occlusion. Consider the following: blurry foreground, sharp midground, blurry background.
We can say the following about the result when viewed through a camera: the blurred information in the foreground will partially occlude the sharp midground, but the blurry background will not occlude the midground. The problem with a simple (or better than simple) 2-D blur is that it doesn't respect these occlusion relationships. It will not allow the foreground to spill over the midground, for example. Basically, the blur assumption does things backwards -- we are not dealing with a convolution, but with "every pixel has a circle of confusion" and (as a pixel) you should make another pixel part of your filter only if its circle overlaps you.
What you really want to do is (as Potmesil recommends) render each pixel as a disc and then piece the image back together. The problem with his approach is that he uses a weighted average, rather than hard occlusion, to combine the results. For best output, he depends on the user to be able to segment the scene into separate objects that occlude one another. For many applications, this is hard to do (his sample scenes have several distinct objects, not continuous objects from foreground to background).
I'll challenge you to investigate finding an occlusion function that allows you to efficiently combine the discs front-to-back, such that the relationships I've described above are maintained.