[Haskell-cafe] Re: FASTER primes

Mon Jan 4 07:25:47 EST 2010

Daniel Fischer <daniel.is.fischer <at> web.de> writes:

> 
> 
> Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
> 
> > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > 
> > > But there's a lot of list constructuion and deconstruction necessary for
> > > the Euler sieve.
> >
> > yes. Unless, of course, s smart compiler recognizes there's only one
> > consumer for the values each multiples-list is producing, and keeps it
> > stripped down to a generator function, and its current value. I keep
> > thinkig a smart compiler could eliminate all these "span" calls and replace
> > them with just some pointers manipulating...
> >
> 
> Of course I'm no smart compiler, but I don't see how it could be even
> possible to replace the span calls with pointer manipulation when dealing
> with lazily generated (infinite, if we're really mean) lists. Even when
> you're dealing only with strict finite lists, it's not trivial to do
> efficiently.

I keep thinking that storage duplication with span, filter etc. is not really 
necessary, and can be replaced with some pointer chasing - especially when 
there's only one consumer present for the generated values. 

What I mean is thinking of lists in terms of produce/consumer paradigm, as 
objects supporting the { pull, peek } interface, keeping the generator inside 
that would produce the next value on 'pull' request and keep it ready for 
any 'peek's.

Euler's sieve is

 sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..])
                      where (h,t) = span (< p*p) xs

Everything lives only through access, so (sieve (tail primes) [5,7]) would 
create an object with the generator which has the 'span' logic inlined:

 sieve ps xs = make producer such that
     p := pull ps -- alter ps as well (actually pull a value from it)
     q := p*p
     peek = x := peek xs
            if x < q then x else peek (remake self)
     pull = x := peek xs
            if x < q then pull xs else pull (remake self)
     remake = ys := minus xs (intsFromBy q (2*p))
              self := sieve ps ys

Here the only thing that gets created are the 'minus' nodes which essentially 
maintain pointers into the two streams that they consume. 'intsFromBy' only has 
to maintain two integers inside it (currentVal and step) as there's no need for 
it to maintain any storage for its results, as they are immediately consumed. A 
persistent list would be represented by a different kind of producer which 
would be given a storage to operate on, upon creation (as would the top level 
variable like 'primes').

The real difference here is between those producers whose values will only be 
consumed once, by one specific consumer, and those which values may be needed 
more than once, so need really to be maintained in some storage. If not - span, 
filter, map, whatever - they all are just little modifiers on top of the real 
producers, which may or may not also have an actual storage maintained by them.

> > again, what I mean is, not _where_ I start crossing them off in a PQ, but
> > _when_. The article's code starts crossing them off _at_ p^2 - by adding
> > p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface
> > until p^2 will be considered for a prime; it'll lay dormant deep inside the
> > queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries
> > will hang out inside the PQ - instead of just 24. A memory blowup. (this is
> > of course fixed in Melissa's ZIP package). Of course due to the nature of
> > PQ it might actually not hurt the performance for a while, depending on
> > partcular PQ implementation. Complexity _and_ constant factors.
> >
> 
> It will start to have an impact pretty soon. Assuming at least one of the
> relevant PQ operations to be Theta(log size), each composite between ~400
> and ~400000 (rough estimates) will take something like twice as long to
> handle. It will start to hurt really badly only a while later, though, as
> a guesstimate, with more than a million primes in the PQ, memory will have
> a hard time.

Exactly!

> > > If, on the other hand, you start crossung off at 2*p, when the main sieve
> > > is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more
> > > than 5.5 million. That starts to become a memory problem rather soon.
> >
> > here you don't have a choice or when to add it - you have to add it at p
> > itself - so the problem is clear. But even when you cross at p^2, the
> > question remains, of when you add the p^2 entry into the PQ. That was my
> > point.
> >
> > Postponed Filters code makes this clear, and thus hard to err on.
> > Unfortunately, it wasn't present  _in_  the article.

> > > I think that remark was meant to apply to composite removal, not Turner's
> > > sieve.
> >
> > It is right there on page 2, right when the Turner's sieve is presented and
> > discussed. The only explanation that I see is that she thought of it in
> > regards to the imperative code, just as her analysis concentrates only on
> > calculation aspects of the imperative code itself.
> >
> 
> To be fair, she writes:
> 
> "Let us ﬁrst describe the original “by hand” sieve algorithm as practiced by
> Eratosthenes. 
> ...
> The starting point of p^2 is a pleasing but minor optimization, which can 
> be made
> because lower multiples will have already been crossed oﬀ when we found
> the primes prior to p. 
> ... (This optimization does not aﬀect the time complexity of the sieve,
> however, so its absence from the code in Section 1 
> *is not our cause for worry*.)"

A-HA! 

But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as 
dealing with it worked wonders on its complexity and constant factors.

This remark only makes sense in the imperative, mutable-storage setting, as 
we've already estalished.

> So it's in context of the imperative code (although rather numbers on paper
> than bits in RAM).
> For imperative (be it array or paper), it is indeed minor; for her PQ sieve,
> it can be considered minor from a theoretical point of view (doesn't change
> time complexity), but practically, anything that changes performance by a
> factor of two or more is only 'minor' for hopelessly inadequate algorithms
> (whether a computation would take ten or twenty billion years is indeed a
> minor difference; one hour or two is a major difference).

++

> However, as you say, the important point is not whether p's multiples get
> crossed off starting from 2*p or from p^2, but whether p's multiples get
> inserted into the queue *when* you look at p or at p^2.

Exactly! Not _where_, but _when_ (in a lifetime of a computational process).

And this (I would say, important) observation was missing from the article 
precisely because of her focusing on the analysis of the _imperative_ algorithm.

But if we focus on functional, it is a logical thing to realise, and when 
implemented shows us the road to the next improvement by a gradual stepwise 
development. A matter of personal preference. :)

> > > ... the cost for
> > > identifying p_k is pi(sqrt p_k) [trial division by all primes less than
> > > sqrt p_k].
> > > p_k is approximately k*log k, so pi(sqrt p_k) is approximately
> > > (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
> > > Summing that, the cost for identifying the first n primes is
> > > Theta(n^1.5/log n).
> >
> > that would correspond to what I've seen much better.
> >
> 
> It should. 

:)

> > The quesion of a memory blowup with the treefolding merge still remains.
> > For some reason using its second copy for a feeder doesn't reduce the
> > memory (as reported by standalone compiled program, GHCi reported values
> > are useless) - it causes it to increase twice.....
> >
> 
> I have a partial solution. The big problem is that the feeder holds on to 
> the beginning of comps while the main runner holds on to a later part. 
> Thus the entire segment between these two points must be in memory. 
> So have two lists of composites (yeah, you know that, but it didn't 
> work so far).

I did. I duplicated everything. The reported memory was twice bigger. :|

> But you have to force the compiler not to share them: enter -fno-cse.
> The attached code does that (I've also expanded the wheel), it reduces the
> memory requirements much (a small part is due to the larger wheel, a factor
> of ~5 due to the non-sharing).
> It still uses much more memory than the PQ, and while the PQ's memory
> requirements grow very slowly, the tree-fold merge's still grow rather 
> fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
> 

Great! I will look into the code. Thanks! :) :)

> 
> 
> 
> Attachment (V13Primes.hs): text/x-haskell, 3621 bytes
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>