# [Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

Peter Green kinch1967 at me.com
Sat Jan 2 09:33:02 EST 2010

```On 31/12/2009, at 6:38 PM, Luke Palmer wrote:

> On Wed, Dec 30, 2009 at 9:39 PM, Peter Green <kinch1967 at me.com> wrote:
>> I can guess that there might be be less laziness and more
>> instantiation when
>>  sorting is introduced,
>
> Yes, by a lot.  Sorting requires keeping the entire list in memory.
> And Haskell lists, unfortunately, are not that cheap in terms of space
> usage (I think [Int] uses 3 words per element).
>
>> but my questions are:
>>        (1) Am I doing anything terribly stupid/naive here?
>>        (2) If so, what can I do to improve space efficiency?
>>
>> TIA!
>>
>>
>> import Data.List (sort)
>> import Data.List.Split (splitOn)
>>
>> -- Cartesian Product over a List of Lists
>> -- cp [[1,2],[3],[4,5,6]] -->
>> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
>> cp          :: [[a]] -> [[a]]
>> cp []       =  [[]]
>> cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]
>
> This cartesian product varies in its tail faster than its head, so
> every head gets its own unique tail.  If you reverse the order of the
> bindings so that it varies in its head faster, then tails are shared.
> If my quick and dirty reasoning is correct, it improves the space
> usage by a factor of the number of sublists.
>
> cp' [] = [[]]
> cp' (xs:xss) = [y:ys | ys <- cp' xss, y <- xs]
>
> But if you're serious, you can probably do better than just generating
> them all and passing them to sort.  I get the impression that there is
> some structure here that can be taken advantage of.
>
>>
>> -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
>> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
>> fromCSV :: [String] -> [[[Int]]]
>> fromCSV = map parseOneLine
>>    where parseOneLine = map parseGroup . splitOn ","
>>              where parseGroup = map read . splitOn "+"
>>
>> -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
>> [[1,3,4],[1,3,5],
>> -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],
>> [2,15,16]]
>> explode :: [[[a]]] -> [[a]]
>> explode =  concatMap cp
>>
>> -- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] -->
>> -- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"]
>> toCSV :: (Show a) => [[a]] -> [String]
>> toCSV = map \$ tail . init . show
>> --toCSV = map (intercalate "," . map show)
>>
>> main = interact (unlines . toCSV . sort . explode . fromCSV . lines)

Thank you everyone for the very helpful suggestions so far!

I think I should re-state the problem and provide more background
info. In answer to the poster who suggested using a Trie, there *is* a
real world application for what I'm attempting to do. I'll also hint
at that below:

I have text files containing 'compressed lists' Compressed lists
look like this:

8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.

Sublists are comma-delimited, and sublist elements are separated by
'+' character. Sublists contain integers in the range 1..20.

I parse these to look like so:

[[[8,12],[11],[7,13],[10]],
[[1,2,3],[1,9],[3,6],[4]],
.
.
]

I need to explode these and produce a lex-sorted list of exploded lists:

[[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],
[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]

I then output this data as comma-delimited lists:

1,1,3,4
1,1,6,4
.
.
12,11,13,10

It is a property of the underlying problem 'structure' that none of
these comma-delimited lists in the exploded data is repeated. i.e. we
can never see two instances of (say) 1,1,3,4.

{ Begin Aside on Why I Would Want to do Such a Silly Thing:

'Compressed lists' are in fact compressed wager combinations for multi-
leg exotic horse racing events. 'Exploded lists' are the single wager
combinations covered by the grouped combinations.

Why use compressed combinations? Well, it's a lot easier to submit
5,000 compressed tickets (whether physically or electronically) than
1M single tickets for the individual combinations we have somehow
managed to represent by the 5,000 compressed tickets.

However, single combinations are the currency of the realm. These are
the underlying wagers and compression is merely a convenience/
necessity to enable single wagers to be placed.

It's  not uncommon to generate large numbers of single combinations:
Imagine 6 races with 15 runners in each race, and a wager requiring
one to select first place getters in all 6 legs. The number of
potential outcomes is 15^6 = 11,390,625. One might decide (somehow)
that it makes sense in a positive expectation way to wager (say) 500K
of these potential outcomes.

Getting from theoretically desirable single combinations to optimally
compressed wagers is actually NP-Complete and another story for
another day. Suffice it to say that one might use some nasty hacks and
heuristics to arrive at compressed wagers - but in the process doing
violence to one's precise coverage of the theoretically desirable
single combinations. That's one reason why I need to recover (by
'exploding') the *actual* wagered single combinations from the
compressed/ wagers.

I need to explode these compressed wagers back to single combinations
because, I need to (eventually) do set comparisons on collections  of
single combinations. i.e. given File A and File B of compressed
wagers, I need to be able to  answer questions like:

(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
.
.
etc. i.e a few of the common set comparison operations.

And the dumbest, quickest and dirtiest and most idiot-proof way of
doing this as a baseline approach before I start using maps, tries,
etc... is to explode Files A and B into lex sorted single combinations
order and then use diff -y with judicious grepping for '>' and <''

End Aside }

I'm probably going to go with an external sort for starters to keep
memory usage down. For most use-cases, I can get away with my current
in-memory sort - just have to beware edge cases.

However, later on, I'm keen to do everything with set theoretic
operations and avoid writing large files of single combinations to disk.

So, I guess the things I'd like to get a feel for are:

(1) Is it realistic to expect to do in-memory set comparisons on sets
with ~1M elements where each element is a (however-encoded) list of
(say) 6 integers? I mean realistic in terms of execution time and
space usage, of course.

(2) What would be a good space-efficient encoding for these single
combinations? I know that there will never be more than 8 integers in
a combination, and none of these values will be < 1 or > 20. So
perhaps I should map them to ByteString library Word8 strings?
Presumably sorting a big list of ByteStrings is going to be faster
than sorting a big list of lists of int?

-------------- next part --------------
An HTML attachment was scrubbed...