# [Haskell] Re: Fingerprints and hashing

apfelmus apfelmus at quantentunnel.de
Thu Oct 11 16:33:03 EDT 2007

```Simon Peyton-Jones wrote:

> For various applications (including identifying common sub-expressions,
> and version tracking in GHC), I'd like a Haskell library that supports
> simple fingerprint operations.

Note that for CSE and hash-consing, there is a no-collision yet very
efficient way of fingerprinting (with a drawback, of course :).

The starting point is the question "why does cryptographic hashing
work"? I mean, uniquely serializing (gödel numbering) a tree like

newtype Exp = Exp (ExpF Exp)
data ExpF a = Zero | One | Add a a | Mult a a
deriving (Functor, Foldable, Traversable)

takes an amount of bits proportional to its size (number of
constructors). So, how can we expect a simple 128-bit number to have
only few collisions for a collection of medium-sized trees? The answer
is that the collection is much too small, no computer will ever be able
to count (or construct said trees) from 1 to 2^128 in eons.

So, the idea is to use a "local gödel numbering" and uniquely number the
and only the trees that are actually constructed (no collisions, but few
in numbers). In other words, every new tree created gets the gödel
number  size collection + 1  and we can simply use

type GödelNumber = Word32

assuming that no more than 4G of trees will ever be constructed. For
fast hash-consing, the collection itself is a (generalized) trie

type Collection  = ExpF GödelNumber ~~> Maybe GödelNumber

mapping each new top of a tree (constructor + gödel numbers for already
known subtrees) to either Just its gödel number or Nothing when it's not
in the collection yet.

With this, CSE becomes a catamorphism that allocates new gödel numbers
if necessary

type StateC a = Collection -> (Collection, a) deriving (Monad)

cse :: Exp -> StateC GödelNumber
cse = cata hashCons

hashCons :: ExpF (StateC GödelNumber) -> StateC GödelNumber
hashCons x0 = \c0 -> let (c,x) = sequence x0 c0 in
case lookup x c of
Maybe g -> (c,g)
Nothing -> let g = size c + 1; c' = insert x g c in (c',g)
where
sequence :: ExpF (StateC GödelNumber) -> StateC (ExpF GödelNumber)
sequence = Data.Traversable.sequence

CSE in the sense that all common subexpressions now have the same gödel
number.

Of course, the drawback compared to "free-form" fingerprinting is that
the fingerprinting and the collection now depend on each other. But for
CSE, we have to carry the collection around anyway.

I don't know any references for that method since I came up with it
myself and haven't searched around yet. Any pointers?

Regards,
apfelmus

```