[Haskell-cafe] Bloom Filter

Mon Apr 30 22:39:28 EDT 2007

G'day.

Quoting tom <tom at almostobsolete.net>:

> I'm pretty new to Haskell, I've been working on a Bloom filter[1]
> implementation as a learning exercise.

Excellent!  Sounds like a fun test.

> I'd really appreciate it if someone more experienced would comment on
> the code. I'm sure there's plenty of places where I'm doing things in
> silly or overly complex ways.

Sure.

All in all, very well done.  It works, and it looks pretty efficient.
My quibbles are mostly stylistic or syntactic in nature.  Please
understand that the relative triviality of my quibbles is a sign that
there are really no major problems.

This is not a criticism, but more an advertisement: What are you using
for source control here?  Darcs is nice, and as a bonus, it's trivially
browsable from a web browser, which saves downloading and unpacking.

General comments:

You overuse parentheses.  A lot.  Definitions like this:

    ary = (listArray (0, wordc-1) (repeat 0))

don't need parentheses around them, and just add to the general noise
level.

And (.&. ((size b)-1)) is much more cleanly expressed as (.&. (size b - 1)).

Rather than carrying around a hash function, it might be better to use
a type class:

    class BloomHash k where
        bloomHash :: k -> [Word8]

In wordsize:

You don't need to hard-code this.  You can use:

    wordsize = bitSize (undefined::Word32)  -- Or Int, of course!

bitSize is defined in Data.Bits.

In splitup:

I got a bit confused by the local binding names.  It's usual, especially
in generic code, to use "xs", "ys" etc for a list of "x" and "y".
Something like this might be more idiomatic:

splitup n xs = let (xs1, xs2) = splitAt n xs
               in xs1 : splitup n xs2

In indexes:

    (fromIntegral $ x `div` wordsize, fromIntegral $ x .&. (wordsize-1))

Seems intuitively wasteful.  Either use divMod or bit operations.

Similarly, (hashfunc b) key is the same as hashfunc b key.  But even
better is:

    split bytecount . hashfunc b $ key

That makes it obvious that it's a pipeline of functions applied to the key.

This looks cool:

    bytes2int = foldr ((. (256 *)) . (+)) 0 . (map toInteger)

but I'm not smart enough to parse it.  This is both more readable and
shorter:

    bytes2int = foldr (\x r -> r*256 + fromInteger x) 0

Integer log2's are probably better done using integers only, or at least
abstracted out into a separate function.

In bloom:

Function guards are your friends!  This:

    bloom hf sz hc = if condition
                     then b
                     else error "Badness"

is almost always better expressed as:

    bloom hf sz hc
      | condition = b
      | otherwise = error "Badness"

You can now inline b.  (I can see why you put it in a where clause; now
you don't have to.)

wordc, again, only needs integral arithmetic:

    wordc = ceiling ((fromIntegral a) / (fromIntegral b :: Double))

is more or less:

    wordc = (a+b-1) `div` b

And drop the parentheses around the definition of ary.

In add:

Try to use function names that are close to names in existing libraries,
like Data.Set.  "insert" sounds better here.

Also, rather than this:

    add :: Bloom a -> a -> Bloom a

a better argument order is this:

    insert :: a -> Bloom a -> Bloom a

That way, you can use it with foldr.

In test:

Again, probably misnamed.  Data.Set calls this "member".  And again,
arguably the wrong argument ordering.

Once again, well done.

Cheers,
Andrew Bromage