enummapset merge into containers

Wed Feb 22 05:52:13 CET 2012

On 2/21/12 3:58 PM, Ben Gamari wrote:
> In light of all of this, it was suggested by Edward Kmett that perhaps a
> HashMap would make sense. In my understanding, he proposed that the cost
> of hashing a short string could be dwarfed by the potential cache misses
> of a deep tree lookup. I haven't tested this hypothesis, but it seems
> plausible. Do you think this trade-off could make sense?

I'm in a similar situation for a natural language processing project. In 
my case I keep around a (HashMap,IntMap) bijection between strings and 
their integerization, and then have something like EnumMap which is used 
elsewhere since the Ints are newtyped. This approach has been more than 
fast enough for my needs; i.e., the performance bottlenecks are 
elsewhere so far as I can tell. I too often have to deal with pairs, 
triples, etc, of IDs--- but I need to do so with a trie's API so, alas, 
I can't use a HashMap for that part.

I don't think that giving out keys sequentially is the worst-case for 
IntMap[1].

In any case, I've been meaning to post my project to Hackage for about a 
year now. I was derailed a bit after I gave a talk about my EDSL for 
smoothing probability distributions and Ed Kmett asked me to factor it 
out into a separate package so he could use it. I've since broken the 
project up into a number of smaller more-reusable packages--- which is 
good in the long run, but I haven't had the time to put the pieces back 
together in order to post it to Hackage. Other than the smoothing EDSL, 
one of the packages is just for interning and integerizing strings. 
Since you're doing the same thing, we should talk off-list about API 
issues so that we can combine our work into a single package.

[1] At any given point all the allocated keys will form a compact 
subspace of Int which all share the high-order bits. Since IntMap is a 
big-endian trie with node-fusion, that means you'll have a complete tree 
for the low order bits and only one comparison for the long spine of all 
the high-order bits. The number of comparisons you need to make is 
always O(B) where B is the number of bits that differ among the 
subspace, but since the subspace is compact that means you're making the 
most-efficient use of the differing bits and so B = O(log N) where N is 
the number of allocated keys. In fact, this is one of the best cases for 
IntMap. The worst case is if you hand out keys in an order which 
maximizes B (e.g., handing out keys in an order like 
00000,10000,01000,00100,00010,...), since that means B will be min(N,W) 
where W is the word-size.

Any time you ensure that B = O(log N) it will be a best-case. For the 
case of sequential keys this amounts to having a complete tree, but for 
other cases it amounts to having an essentially complete tree ---where 
by "essential" we mean that the incompleteness always comes in long 
shared spines, such as the spine above the complete tree for the 
sequentially ordered case. If you gave out keys in the order (map 
bitwiseReverse [0..]) then you'd get an essentially-complete tree at the 
top, with long spines coming off for each of the leaves. You could also 
come up with other crazy orderings where you use, say, the top three and 
bottom three bits but hold the middle bits fixed.

The greater the number of discontiguous regions of variance the worse 
the performance will be, because even though we can still make it B = 
O(log N) we're increasing the constant factor since we have to do a 
comparison for each of the non-varying spines. Thus, [0..] or (map 
bitwiseReverse [0..]) are the best since they have only one non-varying 
spine. As described in the paper, Okasaki chose to use a big-endian trie 
instead of a little-endian trie because the constant factors are better 
for the [0..] case often used in projects that need to hand out unique 
IDs like machine learning, natural language processing, and compilers.

-- 
Live well,
~wren