[Haskell-beginners] Unexpected Space Leak (despite using external-sort)

Mon Jan 4 08:55:47 EST 2010

Am Montag 04 Januar 2010 02:35:04 schrieb Peter Green:
> I am using the external-sort package to sort my output in the program
> below. I made this choice because my output dataset [[Int]] can be
> large (e.g. >3M items, each [Int]).
>
> What my program does:
>
> (1) Reads in a file containing 'compressed lists' which look like so:
>
> 8+12,11,7+13,10
> 1+2+3,1+9,3+6,4
> .
> .
>
> One compressed list per line. These compressed lists are parsed to
> become [[[Int]]]
>
> [[[8,12],[11],[7,13],[10]],
>   [[1,2,3],[1,9],[3,6],[4]],
> .
> .
> ]
>
> Generally files of compressed lists have lengths of ~10,000 lines.
>
> (2) Compressed lists are exploded to [[int]] via concatMap Cartesian
> Product over [[[Int]]], so we end up with [[Int]]
>
> [[8, 11, 7, 10],
>   [8, 11, 13, 10],
>   [12, 11, 7, 10],
>   [12, 11, 13, 10],
>   [1, 1, 3, 4],
>   .
>   .
>   [3, 9, 6, 4]]

I have a different idea. You want to consume the data in order, don't you?
You can interleave sorting and exploding then:

1. read in data (~10,000 lists of lists. In your examples, each line contains a list of 
four short lists, the nested lists containing 1-3 elements. I hope that's not too far from 
reality.)

2. map (\l -> ([],l)) to get [([],[[8,12],[11],[7,13],[10]]), ([],[[1,2,3],[1,9],[3,6],
[4]]), ... ]

3. split first sublists of second component and append to the first component (empty list)

concatMap (\(acc,(toSplit:rest)) -> [(acc ++ [x],rest) | x <- toSplit])

[([],[[8,12],[11],[7,13],[10]]),
 ([],[[1,2,3],[1,9],[3,6],[4]]),
 .
 .
]
~>
[([8],[[11],[7,13],[10]]),
 ([12],[[11],[7,13],[10]]),
 ([1],[[1,9],[3,6],[4]]),
 ([2],[[1,9],[3,6],[4]]),
 ([3],[[1,9],[3,6],[4]]),
 .
 .
]

A list of ~30,000(?) ([Int],[[Int]]) pairs

4. sortBy (comparing fst)     {-# import Data.Ord (comparing) #}
5. groupBy ((==) `on` fst)    {-# import Data.Function (on) #-}

You now have something like
[
 [([1],[[1,9],[3,6],[4]]),
  ([1],[[?],[?],[?]]),
  .
  .
 ],
 [([2],[[1,9],[3,6],[4]]),
  ([2],[[?],[?],[?]]),
  .
  .
 ],
 .
 .
 [([131],[[?],[?],[?]])],
  ([131],[?]),
  .
  .
 ]
]

6. If necessary, repeat 3. to 5. on each of the groups to get

[
 [
  [([1,1],[[3,6],[4]]),
   ([1,1],[[?],[?]]),
   .
   .
  ],
  [([1,2],[[?],[?]]),
   ([1,2],[[?],[?]]),
   .
   .
  ],
  .
  .
 ]
]

concat to get

[
 [([1,1],...),
  .
  .
 ],
 [([1,2],...),
  .
  .
 ],
 .
 .
]

 , iterate until exploding in step 8. produces something manageable.

7. map (\g@((h,_):_) -> (h, map snd g))

You now have something of type [([Int],[[[Int]]])],
[([1],[ [[1,9],[3,6],[4]], [[?],[?],[?]],...]),
 ([2],[[[1,9],[3,6],[4]],[[?],[?],[?]],...]),
 .
 .
 ([131],[[[?],[?],[?]],...])
]

or

[([1,1],[ [[3,6],[4]], [[?],[?]],...]),
 ([1,2],[ [[?],[?]], [[?],[?]],...]),
 .
 .
]

8. explode second components in each of the pairs, sort, append to first component, 
consume.

Thanks to laziness, the [[[Int]]] of the second group will only be exploded after the 
first group has been consumed.

One caveat: if the lines do not always contain lists of equal length, you must not repeat 
steps 3.-6. more often than the shortest length or modify the function in 3. to handle 
that, e.g.

partialExplode (acc,toSplit:rest) = [(acc + [x],rest) | x <- toSplit]
partialExplode (acc,[]) = [(acc,[])]

> -- Cartesian Product over a List of Lists
> -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
> -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],
> [2,3,6]]
> cp          :: [[a]] -> [[a]]
> cp []       =  [[]]
> cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]

change that to

cp (xs:xss) = [y:ys | ys <- cp xss, y <- xs]

to get something more memory friendly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/beginners/attachments/20100104/7aa5bca1/attachment-0001.html