Logfile merging

Michael Herf
November, 2003

back


We were having some trouble tracking our load-balanced logfiles, and I thought there was a simple way to fix that.

The problem happens when you host a website on multiple machines simultaneously. Then you have 'n' logfiles that overlap in time, and you want to merge them efficiently (say, without running a sort like gnu sort, which is slow, and can't parse logfile dates.) In our case, we have many GB of logs a day, and it's very nice to be able to merge at about disk bandwidth, with a low memory footprint.

This also requires that you figure out the command line flags to do 64-bit IO, which are quite simple. To compile the dmerge program with g++, I do this:

g++ -O2 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -o dmerge dmerge.C
And then I seemed to have to use fopen64 instead of fopen, as well. There's a lot of silly code to parse dates, which I'm sure is in a standard library somewhere, but I didn't look.

To use dmerge, you give it several files to merge on the command line (which should each internally be sorted), and it spits out sorted output to stdout. So you can do

dmerge logs* > merged
and it will do the right thing (if you don't run out of file handles).

Code is here: github/herf/dmerge