[Haskell-cafe] Re: Where is Data.Atom?

Tue Jul 4 19:28:22 EDT 2006

Brian Hulley wrote:
 > Iain Alexander wrote:
 >> Another suggestion:
 >>
 >> Put your strings in an ordered binary tree (other data structures
 >> might also work here).
 >>
 >> Make your Atom an encoding of the structure of the tree (resp. other
 >> structure).  This is logically a sequence of bits, 0 for left (less
 >> than), 1 for right (greater than) - if you terminate this
 >> variable-length sequence with 1, I think it all works out OK.  You
 >> then encode this sequence of bits in a Word32 (or some other
 >> convenient item), with implicit trailing zeros.
 >>
 >> root ~> 1[000...]
 >> root.left ~> 01[000...]
 >> root.right ~> 11[000...]
 >> ...
 >>
 >> There are clearly some questions such as how unbalanced is the tree
 >> likely to get (since you can't conveniently rebalance it), and what
 >> to do if your Atom "overflows", but this might give you something to
 >> work with.
 >
 > Thanks for the suggestion. However I think that just using a binary tree
 > would be equivalent to just using the representation of the original
 > string as a sequence of bits (except for "early out" eg representing
 > 0b00001111 as just 1111 etc), and any other systematic encoding of
 > strings that preserved all the info necessary for lexicographic ordering
 > (that still works when new atoms are created later) would also end up
 > using at least the same number of bit comparisons on average (since a
 > new information preserving representation is just a permutation of the
 > original at the bit level - if some strings are represented with fewer
 > bits others would need more bits so it would all balance out).

The main thing is that speeding up Ord is a "memo" problem, not a "clever 
encoding of atoms" problem.

Your fromString function seems like an IO action.  It cannot be a pure function, 
except for identity or a hash function.  (Though hashing to Int will
at least speed up equality testing.).  If you use unsafePerformIO to make it 
seem like a pure function then you need locking even if you think you have a 
single CPU and thread, since the runtime may assume it can evaluate expressions 
in weird ways.  This is the "unsafe at any speed" issue in Haskell.

You might turn strings s1 and s2 into atoms a1 and a2 cheaply by returning the 
next highest unused Word32.  Keep an array of Atom->String, (see 
http://haskell.org/haskellwiki/Library/ArrayRef#Using_dynamic_.28resizable.29_arrays 
  for dynamic arrays).  Call the # of atoms "N_A".  If you need fromString to 
return the same atom if given the same string then you have to keep a tree and 
search down through log(N_A) strings comparisons just to assign a new atom. 
Otherwise this is O(1).

To compare atoms a1 and a2 you have to compare the original strings s1 and s2. 
This original compare is O(min(length(s1),length(s2))).  If you cache this then 
you make future comparisons faster, either O(1) or O(log(N_A)).

Perhaps: (compare a1 a2) could look up the answer in an "immutable" 2D array 
with index (a1,a2) for which the lazy value will then be computed once (and then 
cached). The actual comparison would lookup up the original s1 and s2 via the 
global array (or map) and do the comparison once.   This memo table is the 
explicit space/time trade off.  The Atom->String array and/or the 2D memo-array 
could be replaced with something else, for the usual trade offs. And the only 
thread & locking issues come from generating the next atom without risking 
double-assignment.

-- 
Chris