<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div><div class="h5">I suspect the following might be faster:<br>
<br>
data BinomForest2 a = Empty<br>
| NonEmpty a [BinomTree2 a]<br>
<br>
data BinomTree2 a = BinomTree2 a [BinomTree2 a]<br>
<br>
This eliminates the "Skip" constructor, which contributes only to the<br>
nested type guarantee.<br></div></div></blockquote><div><br></div><div>Ehehehe. This is something I'm pretty proud of inventing, because your implementation is actually significantly *slower*. The reason is essentially that I can do a lot more constructor unpacking when I have access to that much type information about the structure of the tree at each level.</div>
<div><br></div><div>You didn't quite implement it correctly, because among other things, we *need* to track tree rank, at least for the roots, if it's not encoded in the type. Here are your data types, with everything unpacked as much as possible.</div>
<div><br></div><div><div>data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap a) (BinomQueue a)</div><div>-- equivalent to [(Int, BinHeap a)], but unrolled<br>-- we only need to track ranks in the roots</div>
<div>data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree a)</div><div>-- equivalent to [BinHeap a], but unrolled</div><div>data BinHeap a = Bin a (BinomSubtree a)</div><div><br></div><div>
I tested, and this implementation actually performs better if the spine is maintained lazily, so we'll test that version. The full implementation (that is, the basic methods: insert, extract, fromList, toAscList) of your approach was attached to the ticket <a href="http://hackage.haskell.org/trac/ghc/attachment/ticket/3909/QuickBinom.hs">here</a>. Feel free to ghc-core it, or tinker with the implementation, but I've done a fair bit of tinkering, and my results haven't changed.</div>
<div><br></div><div>Running a benchpress test on heapsorting 25000 Ints, calling performGC after every run of each method. SBinomial stands for "sparse binomial heap," which is your approach.</div><div><br></div>
<div>With -O2, +RTS -H128m:</div><div><div><div> min mean +/-sd median max</div><div>Binomial: 0.000 3.440 2.204 4.000 8.001</div><div>SBinomial: 24.001 28.562 5.600 28.001 48.003 (ratio: 8.3x slower average)</div>
</div></div><div><div>With -O2:</div><div><div><div> min mean +/-sd median max</div><div>Binomial: 4.000 8.801 2.606 8.001 24.002</div><div>SBinomial: 32.002 41.763 8.007 42.003 64.004 (ratio: 4.7x slower average)</div>
</div></div><div>Without optimization, with +RTS -H128m:</div><div><div><div> min mean +/-sd median max</div><div>Binomial: 4.001 10.041 3.140 8.001 20.002</div><div>SBinomial: 64.004 76.805 8.790 76.005 100.006 (ratio: 7.6x slower average)</div>
</div><div>Without optimization:</div><div><div><div>Binomial: 12.000 19.761 5.328 16.001 40.002</div><div>SBinomial: 72.004 90.126 11.906 88.006 120.008 (ratio: 4.6x slower average)</div></div></div>
</div><div><br></div><div>These times are measured in milliseconds. Conclusion: my implementation is <i>massively faster</i>, by a factor of at least 4.5x. (I spent the last half hour trying to optimize SBinomial -- it really is that big a difference, and it's not going to change.)</div>
<div><br></div><div>Here's why I think that's the case: even though we might have the Skip constructor, how many Skip constructors are there, total? On average, half the forest is made up of Skips, so there are 1/2 log n Skip values there.</div>
<div><br></div><div>But the thing is that the sparse binomial tree representation can't do anywhere near as much unpacking; in particular, the set of children of each node is not a single-constructor type. That's an O(n) increase in allocation, all for an O(log n) shortening of the spine. That's why it's a bad plan. Most of the work is being done in allocations of nodes in the *trees,* rather than along the spine among the roots. In this area, the crazy, type-system-exploiting approach actually does much less allocation, because it can do a lot more constructor unpacking. Let's test this hypothesis:</div>
<div><br></div><div>My type-magical implementation, -O2, +RTS -sstderr (no -H128m this time):</div><div><br></div><div><div> 3,050,272,052 bytes allocated in the heap</div><div> 240,340,552 bytes copied during GC </div>
<div> 1,087,992 bytes maximum residency (201 sample(s))</div><div> 53,136 bytes maximum slop </div><div> 3 MB total memory in use (0 MB lost due to fragmentation)</div><div>
<br></div><div> Generation 0: 5703 collections, 0 parallel, 0.48s, 0.53s elapsed</div><div> Generation 1: 201 collections, 0 parallel, 0.22s, 0.24s elapsed</div><div><br></div><div> INIT time 0.00s ( 0.00s elapsed)</div>
<div> MUT time 4.52s ( 4.74s elapsed)</div><div> GC time 0.70s ( 0.77s elapsed)</div><div> EXIT time 0.00s ( 0.00s elapsed)</div><div> Total time 5.23s ( 5.51s elapsed)</div><div><br></div><div>
%GC time 13.5% (13.9% elapsed)</div><div><br></div><div> Alloc rate 674,200,547 bytes per MUT second</div><div><br></div><div> Productivity 86.5% of total user, 82.2% of total elapsed</div><div><br></div><div>
The sparse binomial forest representation, same options:</div><div><br></div><div><div> 5,612,965,672 bytes allocated in the heap</div><div> 563,671,500 bytes copied during GC </div><div> 1,967,576 bytes maximum residency (202 sample(s))</div>
<div> 107,212 bytes maximum slop </div><div> 5 MB total memory in use (1 MB lost due to fragmentation)</div><div><br></div><div> Generation 0: 10602 collections, 0 parallel, 1.60s, 1.67s elapsed</div>
<div> Generation 1: 202 collections, 0 parallel, 0.28s, 0.30s elapsed</div><div><br></div><div> INIT time 0.00s ( 0.00s elapsed)</div><div> MUT time 6.52s ( 6.51s elapsed)</div><div> GC time 1.87s ( 1.97s elapsed)</div>
<div> EXIT time 0.00s ( 0.00s elapsed)</div><div> Total time 8.40s ( 8.48s elapsed)</div><div><br></div><div> %GC time 22.3% (23.2% elapsed)</div><div><br></div><div> Alloc rate 860,302,018 bytes per MUT second</div>
<div><br></div><div> Productivity 77.7% of total user, 76.9% of total elapsed</div><div><br></div><div>Furthermore, if we actually profile the sparse forest representation, we confirm that a <i>huge</i> amount of allocation is being done during tree melding.</div>
<div><br></div><div>Conclusion: the type-magical binomial heap implementation does a lot less allocating, and is faster even ignoring GC time. I encourage you to test it out for yourself, but I think you'll find similar results. =D</div>
<div><br></div><div>Louis</div></div></div></div></div></div>