<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Heinrich, thanks for some great help and food for thought!</div><div><br></div><div>(More on performance of your solution below)</div><div><br><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="font-family: 'Sans Serif'; font-size: 10pt; font-weight: 400; font-style: normal; ">Thanks for describing the background of this problem in detail! I was<br>mainly asking because I'm always looking for interesting Haskell topics<br>that can be turned into a tutorial of sorts, and this problem makes a<br>great example.<br></div></span></blockquote><div><br></div>Another interesting problem is starting from a file of single wagers and trying to compress them &nbsp;(i.e. inverse of 'explosion'. I believe this problem is analogous to Set Cover and therefore NP-Complete. Heuristics are the order of the day here.<br><div><br></div><div>(OT, but might also make an interesting multi-part tutorial question/topic:</div><div>A simpler (fairly unrelated problem) is doing exact k-means clustering of a list of n items (i.e. single dimension data) via the k-segmentations of a (sorted by values one is clustering on) list of n items. The number of k-segmentations of an n-list is (n-1) Choose (k-1). One generates all the k-segmentations and each set of k-segments constitutes a candidate clustering solution. One selects the candidate solution which minimises the 'goodness of fit criterion' used in k-means clustering (easily found in the literature). Needless to say, this is only feasible for smallish n and for values of k closer to the LHS and RHS of Pascal's triangle. K-means clustering in more than one dimension is NP-Complete.)</div><br><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="font-family: 'Sans Serif'; font-size: 10pt; font-weight: 400; font-style: normal; ">Concerning optimization, the background also reveals some additional<br>informations, namely<br><br>* The single wager combinations are short lists<br>* and each list element is small, i.e. `elem` [1..20]<br>* No duplicate single wagers<br>* You are free to choose another ordering than lexicographic<br><br><blockquote type="cite">I need to explode these compressed wagers back to single combinations<br></blockquote><blockquote type="cite">because, I need to (eventually) do set comparisons on collections &nbsp;of<br></blockquote><blockquote type="cite">single combinations. i.e. given File A and File B of compressed wagers,<br></blockquote><blockquote type="cite">I need to be able to &nbsp;answer questions like:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">(1) How many single combinations are common to File A and File B<br></blockquote><blockquote type="cite">(2) How many single combinations in File A are missing from File B<br></blockquote><blockquote type="cite">..<br></blockquote><blockquote type="cite">..<br></blockquote><blockquote type="cite">etc. i.e a few of the common set comparison operations.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">And the dumbest, quickest and dirtiest and most idiot-proof way of doing<br></blockquote><blockquote type="cite">this as a baseline approach before I start using maps, tries, etc... is<br></blockquote><blockquote type="cite">to explode Files A and B into lex sorted single combinations order and<br></blockquote><blockquote type="cite">then use diff -y with judicious grepping for '&gt;' and &lt;''<br></blockquote><br>Looks good to me, sorted lists are a fine data structure for set operations.<br><br><blockquote type="cite">I'm probably going to go with an external sort for starters to keep<br></blockquote><blockquote type="cite">memory usage down. For most use-cases, I can get away with my current<br></blockquote><blockquote type="cite">in-memory sort - just have to beware edge cases.<br></blockquote><br>Using an out-of-the box external sort library is probably the path of<br>least resistance, provided the library works as it should.<br></div></span></blockquote><div><br></div><div>Seems to be some issue with the external-sort package. As it currently exists, it blows up on my [[Int]] data and exhibits more memory usage than vanilla sort! I'm keen to look into fixing this as might be one small way I can give back, but still taking baby steps with Haskell.</div><br><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div style="font-family: 'Sans Serif'; font-size: 10pt; font-weight: 400; font-style: normal; ">However, your problem has a very special structure; you can interleave<br>explode &nbsp;and &nbsp;sort &nbsp;and turn the algorithm into one that is partially<br>on-line, which will save you a lot of memory. I've tried to show how to<br>do this in my previous post; the resulting code will look something like<br>this<br><br>&nbsp;&nbsp;&nbsp;{-# LANGUAGE NoMonomorphismRestriction #-}<br>&nbsp;&nbsp;&nbsp;import qualified Data.Map<br>&nbsp;&nbsp;&nbsp;import Control.Arrow (second)<br><br>&nbsp;&nbsp;&nbsp;newtype Map a b = Map { unMap :: [(a,b)] } deriving (Eq, Show)<br><br>&nbsp;&nbsp;&nbsp;instance Functor (Map a) where<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fmap f = Map . (map . second) f . unMap<br><br>&nbsp;&nbsp;&nbsp;hylo :: (Map a b -&gt; b) -&gt; (c -&gt; Map a c) -&gt; (c -&gt; b)<br>&nbsp;&nbsp;&nbsp;hylo f g = f . fmap (hylo f g) . g<br><br>&nbsp;&nbsp;&nbsp;type Row a = [a]<br><br>&nbsp;&nbsp;&nbsp;headsOut :: Map a [Row a] -&gt; [Row a]<br>&nbsp;&nbsp;&nbsp;headsOut (Map []) = [[]]<br>&nbsp;&nbsp;&nbsp;headsOut m &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;= [x:row | (x,rows) &lt;- unMap m, row &lt;- rows]<br><br>&nbsp;&nbsp;&nbsp;explode1 :: [Row [a]] -&gt; Map a [Row [a]]<br>&nbsp;&nbsp;&nbsp;explode1 rows = Map [(x,[row]) | (xs:row) &lt;- rows, x &lt;- xs]<br><br>&nbsp;&nbsp;&nbsp;sort1 :: Ord a =&gt; Map a [b] -&gt; Map a [b]<br>&nbsp;&nbsp;&nbsp;sort1 = Map . Data.Map.toList . Data.Map.fromListWith (++) . unMap<br><br>&nbsp;&nbsp;&nbsp;sortExplode = hylo headsOut (sort1 . explode1)<br><br>&nbsp;&nbsp;&nbsp;example = [[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]]]<br>&nbsp;&nbsp;&nbsp;test &nbsp;&nbsp;&nbsp;= sortExplode example<br><br>&nbsp;&nbsp;&nbsp;main = interact (unlines . toCSV . sortExplode . fromCSV . lines)<br><br>In other words, &nbsp;sortExplode &nbsp;will "stream" the sorted single wagers on<br>demand, unlike the monolithic &nbsp;sort . explode &nbsp;which has to produce all<br>single wagers before sorting them.<br></div></span></blockquote><div><br></div>Thank you *very* much for this code! I will try to get my head around it. I understand the broad outline you posted elsewhere, but will take me a while to fully grasp the above as I'm only up to ~p200 in Real World Haskell :).</div><div><br></div><div>As for performance of your code above on my file of compressed wagers which expands to 3.7M single wagers:&nbsp;<br><div><br></div><div>(My original version posted here and using vanilla sort)</div><div>541 MB total memory in use (5 MB lost due to fragmentation)</div><div><div>INIT &nbsp;time &nbsp; &nbsp;0.00s &nbsp;( &nbsp;0.01s elapsed)</div><div>&nbsp;&nbsp;MUT &nbsp; time &nbsp; 13.98s &nbsp;( 15.82s elapsed)</div><div>&nbsp;&nbsp;GC &nbsp; &nbsp;time &nbsp; &nbsp;8.69s &nbsp;( &nbsp;9.64s elapsed)</div><div>&nbsp;&nbsp;EXIT &nbsp;time &nbsp; &nbsp;0.00s &nbsp;( &nbsp;0.00s elapsed)</div><div>&nbsp;&nbsp;Total time &nbsp; 22.67s &nbsp;( 25.47s elapsed)</div><div><br></div><div>(Your much improved version)</div><div>10 MB total memory in use (1 MB lost due to fragmentation)</div><div><div>&nbsp;&nbsp;INIT &nbsp;time &nbsp; &nbsp;0.00s &nbsp;( &nbsp;0.00s elapsed)</div><div>&nbsp;&nbsp;MUT &nbsp; time &nbsp; &nbsp;7.61s &nbsp;( &nbsp;9.38s elapsed)</div><div>&nbsp;&nbsp;GC &nbsp; &nbsp;time &nbsp; &nbsp;3.48s &nbsp;( &nbsp;3.58s elapsed)</div><div>&nbsp;&nbsp;EXIT &nbsp;time &nbsp; &nbsp;0.00s &nbsp;( &nbsp;0.00s elapsed)</div><div>&nbsp;&nbsp;Total time &nbsp; 11.08s &nbsp;( 12.95s elapsed)</div><div><br></div></div><div>Very impressive and thanks again!</div><div><br></div></div><blockquote type="cite"><div style="font-family: 'Sans Serif'; font-size: 10pt; font-weight: 400; font-style: normal; "><blockquote type="cite">However, later on, I'm keen to do everything with set theoretic<br></blockquote><blockquote type="cite">operations and avoid writing large files of single combinations to disk.<br></blockquote><br>Note that there's also the possibility of not expanding the compressed<br>wagers at all, and perform set operations directly. For instance, it is<br>straightforward to intersect two such sets of size n in O(n^2) time.<br>Since n ~ 5000 , n^2 is about the same ballpark as the exploded single<br>wager combinations.<br></div></blockquote><div><br></div>Not sure I quite understand you here. In my mind, set elements *are* single combinations. It is possible for two quite different-looking files of compressed wagers to contain exactly the same single wager elements. So I'm not sure how to compare without some notion of explosion to single combinations.<br><div><br></div><blockquote type="cite"><div style="font-family: 'Sans Serif'; font-size: 10pt; font-weight: 400; font-style: normal; "><blockquote type="cite">So, I guess the things I'd like to get a feel for are:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">(1) Is it realistic to expect to do in-memory set comparisons on sets<br></blockquote><blockquote type="cite">with ~1M elements where each element is a (however-encoded) list of<br></blockquote><blockquote type="cite">(say) 6 integers? I mean realistic in terms of execution time and space<br></blockquote><blockquote type="cite">usage, of course.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">(2) What would be a good space-efficient encoding for these single<br></blockquote><blockquote type="cite">combinations? I know that there will never be more than 8 integers in a<br></blockquote><blockquote type="cite">combination, and none of these values will be &lt; 1 or &gt; 20. So perhaps I<br></blockquote><blockquote type="cite">should map them to ByteString library Word8 strings? Presumably sorting<br></blockquote><blockquote type="cite">a big list of ByteStrings is going to be faster than sorting a big list<br></blockquote><blockquote type="cite">of lists of int?<br></blockquote><br>The overhead for lists and sets are quite high, it's not uncommon for 1M<br>elements to occupy 10M-100M of memory.<br><br>Not to mention that your elements are, in fact, small lists, introducing<br>another factor of 10-100. But this can indeed be ameliorated<br>considerably by representing your elements in a packed format like<br>ByteStrings or a Word64.<font class="Apple-style-span" color="#000000" face="Helvetica"><span class="Apple-style-span" style="font-size: medium;"><font class="Apple-style-span" color="#144FAE" face="'Sans Serif'" size="3"><span class="Apple-style-span" style="font-size: 13px;"><br></span></font></span></font></div></blockquote><br></div><div>I haven't tried packing ByteStrings yet in this context, but *did* get some improvement in external-sort performance when using ByteStrings instead of [[Int]]. Presumably because it somewhat ameliorated the effects of a bug in external-sort where blocks are fully compared rather than just the heads of blocks.</div><div><br></div><div>I'll be interested to play around with your version more and see if it benefits much from ByteString. It's already very good now in terms of space usage, but perhaps can speed up the I/O (output dominates, of course).</div><div><br></div><div>I'll be learning from your post for a long time. So thanks again!</div></body></html>