[Haskell-cafe] String Hashing

Thomas Conway drtomc at gmail.com
Mon Jun 18 18:52:39 EDT 2007


On 6/19/07, Jan-Willem Maessen <jmaessen at alum.mit.edu> wrote:
> This looks like a version of the Bob Jenkins hash function from
> burtleburtle.net.  I implemented the 32-bit version of this as follows:

Indeed. It's the 64-bit version. 32 bits is oh-so-last-century. ;-)

> mix :: Int32 -> Int32 -> Int32 -> (Int32 -> Int32 -> Int32 -> a) -> a
    [deletia]
> I mention this because your code writes the whole thing out
> longhand---which might be faster, or might not, but certainly misses
> the highest-level structural patterns in the original.  Your use of a
> data type to represent triples is probably better nowadays than my
> rather quirky use of CPS (in other words, this could have been a
> function Triple -> Triple instead of the rather odd type you see above).

Well, the main difference, is the CPS version just folds the uses of
(.) into the individual groups of arithmetic. Actually, without
knowing what GHC *actually* does,  it is conceivable that a compiler
could do better with the CPS version, precisely because there's one
less layer of abstraction to inline/fold away. I'll have to give it a
go if I get a chance (this is code for my Real Job (TM), and tuning
the life out of the code isn't necessary right now, but I thought I'd
float this, as much because I might learn something, as anything. The
thinkon flux in this list is pretty favourable).

> That said, I assume you instrumented your code and determined that
> hash collisions are actually a bottleneck for you, and that a hash
> table is the right structure to begin with?

I'm implementing a species of database (trade secrets, blah, blah,
blah), which needs to handle *large* data sets, and actually, an
external B-tree is probably better than an external hash table. I
decided to do a hash table first though to iron out some of the issues
to do with concurrent external structures. A linear hash table is
pretty simple compared to a B-tree. The Jenkins' hash function comes
into it because you really want to avoid overfull buckets.

It's also one of those cases, where you'd like the compiler to do a
good job. If the compiler can't do a good job of straight line
operations on [essentially] built in data types, then what hope have
we of convincing anyone, including ourselves, that Haskell is fit for
Real Programs (TM).

>  When last I
> checked the result was faster than Data.Map, but not by much, and
> using strings probably wipes out that advantage vs. tries.

<edna-e-mode-voice>
    No Strings, darling! No Strings.
</edna-e-mode-voice>

cheers,
T.
-- 
Dr Thomas Conway
drtomc at gmail.com

Silence is the perfectest herald of joy:
I were but little happy, if I could say how much.


More information about the Haskell-Cafe mailing list