[Haskell-cafe] Motivation for having indexed access in Data.Map?

Sat Jan 7 14:25:31 CET 2012

On 07/01/2012 12:17, Christoph Breitkopf wrote:
> Hello,
>
> I wonder why Data.Map provides the indexed access functions:
>
> http://hackage.haskell.org/packages/archive/containers/latest/doc/html/Data-Map.html#g:21
>
> These functions seem rather out-of-place to me in the map api. The 
> only use case I could think of so far would be to find the median, or 
> in general n-th smallest key, but that does not seem sufficient reason 
> (also, I think there are faster methods for that). Anything else?
>
I don't know the motivation in Data.Map, but here's some thoughts from a 
C++ home-rolled data structures perspective...

Somewhere around a decade ago, I started an in-memory C++ multiway tree 
library, initially an experiment seeing if I could improve sequential 
access performance. This half-worked, but I still use the data structure 
primarily because it's a bit safer in some cases than the STL 
containers, and also has some extra functionality that makes it more 
convenient.

Features include...

 1.   "cursor maintenance" (when I insert/delete, cursors/iterators are
    not invalidated except in the special case that the cursor
    references an item that is deleted. There are two tricks for this
    case - the cursor will at least know that the item is deleted, plus
    there are special cursors that can defer deletion (mainly for
    delete-the-current-item within loops).
 2. Searching based on custom "comparisons" - mainly searching based on
    a partial key (certain fields), so you can find the first/last item
    equal to a partial key, ignoring less significant fields.
 3. Finding the first key that is *not* in the container (for unsigned
    integer keys only).
 4. Subscripted access - finding a given index, determining the index to
    an item referenced by a cursor, stepping forward/backward by a given
    number of items.

The subscripted access isn't massively useful - it was implemented 
because I was curious how to handle it efficiently. However, cases do 
come up from time to time in strange places. For example, sometimes it's 
more convenient to store an index (into a container that won't change) 
than a cursor or a full key. And using an ordered container does tend to 
imply, after all, that you're interested somehow in the order (or else 
why not use a hash table?).

One case, I guess, relates to DSL-generated data structures. The point 
there is that when the generated code runs, the map instance is long 
dead. Within the generated code, ranges etc tend to be identified by 
subscript - so the DSL needs to be able to translate from key to 
subscript, and (maybe) back again. OTOH, don't forget that laziness 
thing - if the code generator was working from a sorted array it would 
know the subscripts anyway.

A particularly surprising side-effect - along with the map, multimap, 
set and multiset wrappers, I have a vector wrapper. When you have a huge 
array and do lots of inserts and deletes within that array, a multiway 
tree (with subscripted access) turns out to be a good trade-off. Some 
accesses are more awkward (because the items aren't all contiguous in 
memory), but the log-time inserts and deletes can be worth it.

The first-key-not-in-the-container stuff was mostly a side-effect of the 
data structure augmentation I did for subscripted access. That is very 
convenient, but with costs.

The no. 1 killer feature that keeps me using and maintaining this 
library is the partial-key search thing. This is so useful, I even added 
a feature to a DSL (used mainly for AST nodes and multiple dispatch - 
originally based on treecc) to make it more convenient to generate 
partial-key classes.

The cursor maintenance makes it a lot easier to write algorithms that 
update the container, but it's perhaps surprising how rare that's necessary.

The issue with all this is of course partly overhead, but also because I 
got lazy - keeping these things hanging around throughout whole program 
runs like cheap second-rate databases. They are quite convenient to work 
with, but for a long time I stopped even considering pulling all the 
data out into a flat array, processing it there, then rebuilding a new 
indexed data structure only if I really needed it, or keeping data 
mostly in an array and sorting it ready for binary searches just at the 
key point where that's needed.

Some programs I've written using them are maybe an order of magnitude 
slower than they should be, and in quite a few cases there's an 
asymptotic difference, not just a constant factor - a lot of algorithms 
are O(n log n) where without the convenience containers they could be O(n).

Very little of this would be relevant in a pure functional programming 
world, of course, but anyway - yes, subscripting can be (occasionally) 
useful. It's just hard to give specific examples, because they're buried 
in all the technicalities of quite large programs.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120107/0e752562/attachment.htm>