# [Haskell-beginners] Performance of Prime Generator

Sun Feb 5 03:30:49 CET 2012

```Sorry for the late reply, haven't checked my mail for a while.

On Tuesday 24 January 2012, 13:50:32, Ertugrul Söylemez wrote:
> Daniel Fischer <daniel.is.fischer at googlemail.com> wrote:
> > > > Well, thanks, so far I have tried wheel only and Sieve of
> > > > Eratosthenes only approaches. They both failed. When the numbers
> > > > is between 999900000 and 1000000000, they can take more than a
> > > > hour on my laptop.
> >
> > They shouldn't. But you have to use the right data structures.
> >
> > For a prime sieve, you need unboxed mutable arrays if you want it to
> > be fast. You can use STUArrays from Data.Array.ST or mutable unboxed
> > vectors from Data.Vector.Mutable.Unboxed.
>
> That's what I've used.  You find the code on hpaste [1].  It's a
> carefully optimized Sieve of Eratosthenes and needs around 20 seconds
> for the primes up to 10^9.  See the refinement in the annotation, which
> I've added just now.  Before that it took around 35 seconds.
>
> I considered that to be "too slow".

Right you are. Unless you need the primes to perform some time-consuming
calculations with them after sieving, that is too slow.

But let us compare it with a similar C implementation.

>
> [1] http://hpaste.org/56866

The interesting parts are:
========================
soeST :: forall s. Int -> ST s (STUArray s Int Bool)
soeST n = do
arr <- newArray (0, n) True
mapM_ (\i -> writeArray arr i False) [0, 1]
let n2 = floor (sqrt (fromIntegral n))

let loop :: Int -> ST s ()
loop i | i > n2 = return ()
loop i = do

let reset :: Int -> ST s ()
reset j | j > n = return ()
reset j = writeArray arr j False >> reset (j + i)

when b (reset (i*i))

loop (succ i)

loop 2
return arr

soeCount :: Int -> Int
soeCount = length . filter id . elems . soeA
========================

With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534
primes up to 10^9. That's in the same ballpark as your time, maybe a bit
faster, depends on what 'around 20 seconds' means exactly.

Let's be unfriendly first:

Arracc.h:
============
#include <stdint.h>

inline int readArray(uint64_t *arr, int n, int i);
inline void writeArray(uint64_t *arr, int n, int i, int b);
============

Arracc.c:
============
#include <stdlib.h>
#include "Arracc.h"

inline int readArray(uint64_t *arr, int n, int i){
if (i < 0 || i > n){
perror("Error in array index\n");
exit(EXIT_FAILURE);
}
return (arr[i >> 6] >> (i & 63)) & 1;
}

inline void writeArray(uint64_t *arr, int n, int i, int b){
if (i < 0 || i > n){
perror("Error in array index\n");
exit(EXIT_FAILURE);
}
if (b) {
arr[i >> 6] |= (1ull << (i&63));
} else {
arr[i >> 6] &= ~(1ull << (i&63));
}
}
============

main.c:
============
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#include "Arracc.h"

uint64_t *soeA(int n);
int pCount(uint64_t *arr, int n);

int main(int argc, char *argv[]){
int limit = (argc > 1) ? strtoul(argv[1],NULL,0) : 100000000;
uint64_t *sieve = soeA(limit);
printf("%d\n",pCount(sieve,limit));
free(sieve);
return EXIT_SUCCESS;
}

uint64_t *soeA(int n){
int s = (n >> 6)+1;
uint64_t *arr = malloc(s*sizeof *arr);
if (arr == NULL){
perror("Allocation failed\n");
exit(EXIT_FAILURE);
}
int i, j, n2 = (int)sqrt(n);
for(i = 0; i < s; ++i){
arr[i] = -1;
}
writeArray(arr,n,0,0);
writeArray(arr,n,1,0);
for(i = 2; i <= n2; ++i){
for(j = i*i; j <= n; j += i){
writeArray(arr,n,j,0);
}
}
}
return arr;
}

int pCount(uint64_t *arr, int n){
int i, count = 0;
for(i = 0; i <= n; ++i){
++count;
}
}
return count;
}
============

\$ gcc -c -O3 Arracc.c
\$ gcc -c -O3 main.c
\$ gcc -O3 main.o Arracc.o -lm

Takes 18.3 seconds. Oops, slower than the Haskell programme.

Okay, ghc does some cross-module inlining and that enables further
optimisations which we denied gcc here, so recompile everything with -flto
additionally. That brings the C version down to 12.8 seconds.

Much better, but still not overwhelming. gcc can do a little better with
everything in one file, 12.72 seconds.

We can squeeze out a little more by omitting the bounds checks for array
indexing, 12.63 seconds.

But if we do the same for our Haskell programme, replace
14.3 seconds.

I have to admit that I don't know why the bounds-checking costs very little
in C and a substantial amount in Haskell, but hey, who refuses free
optimisations.

Lesson 1: Omit pointless array bounds checks.
An array bounds check is pointless if you have just checked the validity of
the index on the line above.

Where are we?

13% slower than C (gcc-4.5.1).

Not a bad result for ghc, but the algorithm is less than optimal.

What's the deal?

We cross off multiples of primes 2549489372 times, and we need (10^9+1)/8
bytes of memory for the sieve.

Both numbers are higher than is good for us. Since the sieve is so large
and we cross off the multiples of each prime sequentially until we reach
the sieve limit, we have lots of cache misses. Bad for performance.
And each crossing-off takes a couple of clock cycles,

arr[i >> 6] &= ~(1ull << (i&63));

shift index, fetch value, (index & 63), shift 1, complement, and with
value, store value.

I'm no CPU expert, so I don't know how many cycles that takes, but even
though some of the operations can be done in parallel on modern CPUs, it's
still several cycles.

Lesson 2: Avoid duplicate work and redundant data.
It's easy to separate even and odd numbers, there aren't many even primes,
and it's unnecessary to mark even numbers as multiples of odd primes.

Separating the crossing-off of even and odd numbers, crossing off only odd
multiples of odd primes, reduces the number of crossings-off by almost 40%
and the running time for the C programme to 8.63 seconds (I haven't done

If we go the small step further and eliminate the even numbers from the
sieve completely, we reduce the required memory by half and the crossings-
off by almost 60% (the fraction of eliminated crossings-off slowly
decreases to 50% for higher limits, but as far as today's RAM takes us, it
remains close to 60%).

At the cost of very slightly more complicated code, we can thus reduce the
running time to 6.06 seconds (C) resp. 6.49 seconds (Haskell) (about 7%
slower than C, not bad at all).

That's a pretty good start, for some purposes it's already good enough.

If we need more, the next step is to eliminate multiples of 3 from the
sieve. That reduces the memory requirements by a further factor of 2/3, and
the number of crossings-off by a further (mumble, didn't do an exact
calculation, educated guess) roughly 40% (that fraction decreases to 1/3
for higher limits, but again very slowly).

The code becomes a bit more complicated again - more than in the previous
step, but still fairly straightforward.

The running time reduces by another factor of 0.65-0.7 or so. We'd be in
the region of 4-5 seconds then.

One can eliminate the multiples of further small primes, for each prime the
additional code complexity increases and the reduction in work/running time
decreases. At some point, the slowdown from the more complex code
annihilates the gain from the reduced work. I stopped after eliminating
multiples of 5, 7 might be worth it, but I'm not convinced.

Using a segmented sieve provides better locality at the cost of more
complex code. If the sieve is small enough to fit in the L2 cache and large
enough that some significant work can be done per segment, it's a net gain.

tl,dr: The naive algorithm we started from is too slow for higher limits,
to get goodish performance, it has to be tweaked heavily. But Haskell plays
in the same league as C (at least the C I've written).

>
> > > I haven't tried it, but an equivalent C implementation can easily
> > > compute the first 10^9 prime numbers within seconds.
> >
> > You mean the primes up to 10^9 here?
>
> Yes, sorry.  And I was referring to the Sieve of Atkin there, but you
> are right.  That one is only slightly faster.

Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted
to my L1 cache, it counts the primes to 10^9 in 0.68 seconds.
But it's only heavily optimised for primes to 2^32. Above that, my sieve
catches up (and overtakes it around 5*10^11, yay).

>
>
> Greets,
> Ertugrul

Cheers,
Daniel

```