« Linux: Under Pressure | Main | German Edition of The Internet Slum Now Available »

Wednesday, March 30, 2005

Perl/Apache: Parsing Apache HTTPD Logs with Perl Patterns

My current finite but unbounded project is a facility for merging Web server (and other) logs from independent servers in a "server farm". The servers in question need not be at the same location nor on the same network.

I'll be rolling this project out as I transition the Fourmilab server farm from primary/backup mode to "all servers are peers" sometime in April or May. At that point, incoming requests from the Web will be routed to whichever server is available and reports the lowest instantaneous load. Averaged over a day, each server will process about 1/n of the total requests where n is the total number of servers (assuming they all have the same capacity).

In this distributed processing environment each server keeps its own access log, so monitoring site-wide statistics requires consolidating or aggregating logs from all of the servers in the farm into a single log. This must, of course, cope with logs on individual servers being cycled (rotated), malformed entries in logs, and all the slings and arrows of outrageous packets that a production site on the Internet is prone to.

The foundation of any project like this is parsing the individual items in the log files. You may find anything in a log--the parser must be robust in the face of whatever it encounters. Here is my current cut of a parsing pattern for Apache HTTPD logs in "Combined" format; for "Common" format, delete the last two $pat_quoted_field items.

    my $pat_ip_address = qr/(\d{1,3} \.
        \d{1,3} \.
        \d{1,3} \.
        \d{1,3})/x;
    my $pat_quoted_field = qr/"((?:(?:(?:(?:    # It can be...
        [^"\\])* |  # ...zero or more characters not quote or backslash...
        (?:\\x[0-9a-fA-F][0-9a-fA-F])* | # ...a backslash quoted hexadecimal character...
        (?:\\.*)                         # ...or a backslash escape.
       ))*))"/x;
    my $parse_combined = qr/^       # Start at the beginning
         $pat_ip_address \s+        # IP address
         (\S+) \s+                  # Ident
         (\S+) \s+                  # Userid
         \[([^\]]*)\] \s+           # Date and time
         $pat_quoted_field \s+      # Request
         (\d+) \s+                  # Status
         (\-|[\d]+) \s+             # Length of reply or "-"
         $pat_quoted_field \s+      # Referer
         $pat_quoted_field          # User agent
         $                          # End at the end
       /x;
There are a number of subtleties in this pattern definition. The individual item repeats in the $pat_quoted_field pattern are essential to avoid denial of service attacks which, having been fended off by your Web server, may bring your log analyser to its knees by crashing Perl. Perl isn't supposed to crash--if you manage to run into a configuration or system resource limit, it should die cleanly without the possibility of creating a stack overflow exploit or denial of service attack. I've reported this problem to the perlbug list, and I shall not disclose it further here. Still, I claim the patterns above are correct for processing all well-formed Apache HTTPD log items in "combined" format. If you have counter-examples, the Feedback button is ready to put them onto my plate.

Update: I've corrected a typo in the $pat_quoted_field pattern definition. The component for backslash escaped hexadecimal digits erroneously allowed repeats of the last digit, not the entire sub-pattern as intended. (2005-03-30 22:25)

Posted at March 30, 2005 00:16