<table cellspacing="0" cellpadding="0" border="0" ><tr><td valign="top" style="font: inherit;">Hi all,<br><br>The task I'm trying to accomplish:<br><br>Given a log file containing several lines of white space delimited entries like this:<br><br>[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist<br>[Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist<br>[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist<br>[Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist<br><br>filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this:<br><br>("GET /url1 HTTP/1.1]:", 2)<br>("GET /url2 HTTP/1.0]:",
1)<br><br>Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email).<br><br>The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds.<br><br>Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code:<br><br>--- begin haskell code ---<br><br>import Text.Regex.Posix ( (=~) )<br>import qualified Data.List as List <br>import qualified Data.Map as Map<br>import qualified Data.ByteString.Lazy.Char8 as LB<br><br>main = do<br> contents <- LB.readFile "log_file"<br> putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .<br> -- create a Map grouping & counting matching tokens and sort based on the counts<br> List.sortBy (\(_, x) (_, y) -> y `compare`
x) . Map.toList . Map.fromListWith (+) . filtertokens .<br> LB.lines $ contents<br> where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) []<br> -- filter lines starting with " 08:"<br> where f = (=~ " 08:") . LB.unpack<br> -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1)<br> g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line<br><br>--- end haskell code ---<br><br>This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent
of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result.<br><br>I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.<br><br>Thanks in advance.<br><br>P.S. For reference, here's my corresponding perl implementation:<br><br>--- start perl code ---<br><br>#!/usr/bin/perl<br>use strict;<br>use warnings FATAL => 'all';<br><br>my %urls;<br>open(FILE, '<', $ARGV[0]);<br>while(<FILE>) {<br> if (/ 08:/) {<br> my @words = split;<br> my $key = join(" ", ($words[6], $words[7], $words[8]));<br> if (exists $urls{$key}) { $urls{$key}++ }<br> else { $urls{$key} = 1
}<br> }<br>}<br>for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }<br><br>--- end perl code ---<br><br></td></tr></table><br>