[Haskell-cafe] Google Summer of Code student application deadline approaches

Johan Tibell johan.tibell at gmail.com
Tue Mar 31 17:57:54 EDT 2009


Hi,

The Summer of Code student application deadline is April 3rd and we
need more applications! If you have an idea that you would like to
hack on this summer hurry up and apply.

Here's an idea that could be implemented in a summer:

A Sawzal [1]l like library that processes large volumes of data using
monoids to compute aggregate statistics. Foldable makes this possible
for smaller data sets than can fit in memory but many interesting data
sets are tens or hundreds of gigabytes in size. A simple API with a
high performance implementation would make Haskell a nice data
analysis tool. Here's a strawman interface for such a library:

-- | Given a file of log records compute aggregate statistics by
converting each record
-- to a monoid @m@ and combine the resulting monoids using 'mappend'.
fold :: (Record r, Monoid m) => (r -> m) -> FilePath -> IO m

There are lots of interesting optimizations that could be done.
Starting with an efficient single threaded implementation using
ByteString you could add the ability to either process many files in
parallel or splitting one file into many chunks and process each chunk
in parallel. The Wide Finder 2 [2] challenge has a fast Ocaml
implementation of the latter strategy. One could take the library
further by running the processing on multiple machines like in the
Google Sawzall implementation.

1. http://research.google.com/archive/sawzall.html
2. http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2

Cheers,

Johan


More information about the Haskell-Cafe mailing list