slightly ot: data structure question

Fri, 02 Aug 2002 18:35:55 +0200

Andrew Bromage wrote
> On Mon, Jul 29, 2002 at 02:50:55PM -0700, Hal Daume III wrote:
> 
> > I need a data structure which is a map from Ints to Doubles; the
> > distribution of the Ints is in the range say 0-20000 and a map will
> > contain somewhere around 100-200 elements.  I need to be able to query
> > *very* quickly, insert *quite* quickly, and iterate through the elements
> > very quickly.  I'm thinking hash tables, but I'm not sure.  Does anyone
> > have any thoughts on this?
> 
> How about this?
> 
>         http://citeseer.nj.nec.com/briggs93efficient.html
> 
> Another paper which might be useful:
> 
>         Aasa, A., Holstrom, S. and Nilsson, C. (1988),
>            "An Efficiency Comparison of Some Representations of Purely
>            Functional Arrays". BIT 28(3):490--503.
You aren't doing sparse matrix computations are you?  In that case there are
specialised storage techniques around, which are probably better than any
general solution.  If you ask on sci.math.num-analysis you'll probably get
more references than you know what to do with.

The method in the first paper effectively has one array of length 20000 allowing
you to query *very* quickly, plus a smaller array listing the elements.
I think the main problem with this would be the array of 20000 elements.  Apart from
the cost of memory (which is probably pretty insignificant), you have the problem that
20000 elements is 160kB which is quite a lot to all keep in cache at once.  If the elements 
are uniformly distributed it seems to me you might get better performance by, say, using
a closed hash table with size say 1024 and indexing just by some selection of 10 bits from
the integers.  (Of course if the integers are not uniformly distributed you will need
a better hash function.)  Then in 80-90% of cases you will get a positive/negative answer
to the query in just one lookup; if it's a don't know (there is a different element
at this position) you will only have to scan through the same cache line (almost certainly)
until you get the right answer.  You can also dispense with the first paper's auxiliary
array this way by setting the integers in the array to out-of-range values (or by using
signalling NaN's for the doubles); iterating through the array checking all the integers
shouldn't take too long.