FAO Ketil Malde [Fwd: Re: Data.Map, Data.IntMap documentation]

Mon Aug 20 04:50:45 EDT 2007

Hello Folks,

Sorry to post this here but my ISP seems to have been blacklisted by
University of Bergen SpamAssassin or something and I didn't want to
appear to ignore Ketil.

-------- Original Message --------
Subject: Re: Data.Map, Data.IntMap documentation
Date: Mon, 20 Aug 2007 09:05:56 +0100
From: Adrian Hey <ahey at gotadsl.co.uk>
To: Ketil Malde <ketil at ii.uib.no>
References: <1186833468.46bda43c6b01b at fred.soi.city.ac.uk>	
<944143.93987.qm at web56411.mail.re3.yahoo.com>
<fa115t$r9f$1 at sea.gmane.org>	 <46C42C5C.9090003 at iee.org>
<1187337291.6553.128.camel at nmd9999>

Ketil Malde wrote:
> On Thu, 2007-08-16 at 11:52 +0100, Adrian Hey wrote:
> 
>> http://darcs.haskell.org/packages/collections-ghc6.6/Data/Map/AVL.hs
>> http://darcs.haskell.org/packages/collections-ghc6.6/Data.Trie.General/Data/Trie/General/Types.hs
> 
> I have some Map-intensive code which may be a useful benchmark for these
> (i.e. most of the time is spent building and then doing lookups in a
> large Map - 100s of Mb, typically).
> 
> Is AVL compatible enough that I would only need to change the imports to
> test this?

It should be, unless you're using indexing or something. Otherwise you
might get some deprecation warnings (but those functions that are
deprecated are still implemented).

Also, the Monoid instance is different.

Also, size is O(n), not O(1). (But the only reason this is O(1)
with Data.Map is that you pay extra for this information on every
operation that created the Map in the first place).

> What kind of performance would you expect compared to the
> regular Map?

It depends what your doing. There are some aspects of the current
Data.Map that will cause sucky performance in some circumstances, but
if your not falling foul of those then I would expect a modest but
not particularly dramatic performance increase.

Particular problems with Data.Map seem to be...
1 - Poor balancing can mean it can take over twice as long to grow a
tree by insertion in "pathological" cases (keys are in fact sorted
but you don't know in advance that they are sorted).

2 - For union,intesection etc. Hedge algorithm seems to require
something 4 to 5 times as many comparisons as divide and conquer
for large maps. Of course what effect this actually has depends
on key type and cost of comparison.

The clone also gives you much better strictness contol, which may
have some influence on performance, space behaviour etc for better
or worse (but only if you modify your code to use it of course :-).

The only possible adverse factor with the clone is that there's an
extra level of indirection overhead in that the underlying
implementation is an AVL tree of boxed (key,value) pairs. Again how
significant this is depends on key type and comparison costs.
For lookup it may be significant for simple keys with cheap comparison.

There's also some advantage in reduced heap burn rate as a result
of this during insertion etc.. (smaller tree records), and of
course better balancing also helps keep paths shorter and heap burn
rate down.

Anyway, it'd be good to get some feedback re what effect it has
on your program, or even whether or not it works :-). It hasn't
been given any real thorough tests as such (but then again neither
has Data.Map if bug reports are anything to go by). But it's a
fairly simple wrapper around AVL (which has been thoroughly tested),
so hopefully there's not too much wrong with it.

If you try it you should checkout the darcs repository (the hackage
version is out of date and nothing like the version in the repository).

Regards
--
Adrian Hey