0% found this document useful (0 votes)
75 views2 pages

Complete Source Code: Putting It All Together

This document discusses using regular expressions to analyze httpd log files and summarize the results. It shows how to use regexes to split log lines into fields, extract the URL, file type, hour of the request, and count requests by each of those fields. A report_section subroutine is defined to output the results in a consistent format with a header and sorted contents.

Uploaded by

pankajsharma2k3
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views2 pages

Complete Source Code: Putting It All Together

This document discusses using regular expressions to analyze httpd log files and summarize the results. It shows how to use regexes to split log lines into fields, extract the URL, file type, hour of the request, and count requests by each of those fields. A report_section subroutine is defined to output the results in a consistent format with a header and sorted contents.

Uploaded by

pankajsharma2k3
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Putting it all together

Regular expressions have many practical uses. We'll look at a httpd log analyzer for an example. In our last article, one of the play-around items was to write a simple log analyzer. Now, let's make it a bit more interesting: a log analyzer that will break down your log results by file type and give you a list of total requests by hour. (Complete source code.) First, let's look at a sample line from a httpd log:
127.12.20.59 - - [01/Nov/2000:00:00:37 -0500] "GET /gfx2/page/home.gif HTTP/1.1" 200 2285

The first thing we want to do is split this into fields. Remember that the split() function takes a regular expression as its first argument. We'll use /\s/ to split the line at each whitespace character:
@fields = split(/\s/, $line);

This gives us 10 fields. The ones we're concerned with are the fourth field (time and date of request), the seventh (the URL), and the ninth and 10th (HTTP status code and size in bytes of the server response). First, we'd like to make sure that we turn any request for a URL that ends in a slash (like /about/) into a request for the index page from that directory (/about/index.html). We'll need to escape out the slashes so that Perl doesn't mistake them for terminators in our s/// statement.
$fields[6] =~ s/\/$/\/index.html/;

This line is difficult to read, because anytime we come across a literal slash character we need to escape it out. This problem is so common, it has acquired a name: leaning-toothpick syndrome. Here's a useful trick for avoiding the leaning-toothpick syndrome: You can replace the slashes that mark regular expressions and s/// statements with any other matching pair of characters, like { and }. This allows us to write a more legible regex where we don't need to escape out the slashes:
$fields[6] =~ s{/$}{/index.html};

(If you want to use this syntax with a matching expression, you'll need to put a m in front of it. /foo/ would be rewritten as m{foo}.) Now, we'll assume that any URL request that returns a status code of 200 (request OK) is a request for the file type of the URL's extension (a request for /gfx/page/home.gif returns a GIF image). Any URL request without an extension returns a plain-text file. Remember that the period is a metacharacter, so we need to escape it out!
if ($fields[8] eq '200') { if ($fields[6] =~ /\.([a-z]+)$/i) {

$type_requests{$1}++; } else { $type_requests{'txt'}++; } }

Next, we want to retrieve the hour each request took place. The hour is the first string in $fields[3] that will be two digits surrounded by colons, so all we need to do is look for that. Remember that Perl will stop when it finds the first match in a string:
# Log the hour of this request $fields[3] =~ /:(\d{2}):/; $hour_requests{$1}++;

Finally, let's rewrite our original report() sub. We're doing the same thing over and over (printing a section header and the contents of that section), so we'll break that out into a new sub. We'll call the new sub report_section():
sub report { print ``Total bytes requested: '', $bytes, ``\n''; print "\n"; report_section("URL requests:", %url_requests); report_section("Status code results:", %status_requests); report_section("Requests by hour:", %hour_requests); report_section("Requests by file type:", %type_requests); }

The new report_section() sub is very simple:


sub report_section { my ($header, %type) = @_; print $header, "\n"; for $i (sort keys %type) { print $i, ": ", $type{$i}, "\n"; } print "\n"; }

We use the keys function to return a list of the keys in the %type hash, and the sort function to put it in alphabetic order. We'll play with sort a bit more in the next article.

You might also like