enummapset merge into containers

Ben Gamari bgamari.foss at gmail.com
Tue Feb 21 21:58:51 CET 2012


On Tue, 21 Feb 2012 11:25:05 -0800, Johan Tibell <johan.tibell at gmail.com> wrote:
> On Tue, Feb 21, 2012 at 11:20 AM, Ben Gamari <bgamari.foss at gmail.com> wrote:
> > This is certainly an option. It's actually been suggested that I use
> > HashMaps instead of IntMaps to get more uniform use of the
> > key-space. I have been a little worried that the cost of hashing would
> > outweigh the benefit of using benefits of more shallow trees, but I
> > suppose memory accesses are expensive. That being said, I could
> > certainly just use the Enum instance.
> 
> In the case of Ints and newtypes thereof hashing is very cheap. A
> no-op or a multiplication with a large prime.
> 
Sure. In my application (machine learning) we were mapping from tokens
(short strings, generally less than 15 bytes) to sequential newtype'd
Ints (newtype Token = Token Int). However, the fact that these
identifiers were sequential meant that IntMap's trees would be close to
the worst-case log(N) depth. We would use the hash of the string if it
weren't for the fact that at times we'd want to index a map by pairs of
tokens. For this, I used an awful hack where I'd make a pair of Enums
an instance of Enum by bitwise operations,

  instance (Enum a, Enum b) => Enum (a,b) where
    fromEnum (a,b) = (fromEnum a `shiftL` 32) .|. fromEnum b
    ...
    
This never sat well with me but it worked and I didn't see any equally
efficient alternatives.

A lesser issue of this approach is the need to keep around a bijection
between Tokens and their associated strings for the purpose of adding
new data or presenting results.

In light of all of this, it was suggested by Edward Kmett that perhaps a
HashMap would make sense. In my understanding, he proposed that the cost
of hashing a short string could be dwarfed by the potential cache misses
of a deep tree lookup. I haven't tested this hypothesis, but it seems
plausible. Do you think this trade-off could make sense?

Using Hashable has the attendent advantage of easily accomodating pairs,
as well as not requiring the Token<->String mapping. It has the
disadvantage, however, potentially leading to collisions. Given the size
of our data sets, though, it seems unlikely that this would happen
(although the outcome could be rather unfortunate if it did).

> > Anyways, I would be fine with seeing HashMap merged in leiu of
> > enummapset. It would be nice to see at least one merged, however.
> 
> I'm hurrying slowly on purpose here. I'm almost finished with a new
> implementation of HashMap and it's easier to experiment while it's in
> a separate package (e.g. people expect less stability.) My goal is to
> eventually merge though.
> 
I'm very glad to hear this. 

Cheers,

- Ben



More information about the Libraries mailing list