Zhi-Qiang Lei zhiqiang.lei at gmail.com
Sun Feb 5 06:35:52 CET 2012

```Hi Guys,

I just find the STUArray you are using is strange for me. Is there any tutorial for that? Google didn't help. Thanks.

On Feb 5, 2012, at 10:30 AM, Daniel Fischer wrote:

> Sorry for the late reply, haven't checked my mail for a while.
>
> On Tuesday 24 January 2012, 13:50:32, Ertugrul Söylemez wrote:
>> Daniel Fischer <daniel.is.fischer at googlemail.com> wrote:
>>>>> Well, thanks, so far I have tried wheel only and Sieve of
>>>>> Eratosthenes only approaches. They both failed. When the numbers
>>>>> is between 999900000 and 1000000000, they can take more than a
>>>>> hour on my laptop.
>>>
>>> They shouldn't. But you have to use the right data structures.
>>>
>>> For a prime sieve, you need unboxed mutable arrays if you want it to
>>> be fast. You can use STUArrays from Data.Array.ST or mutable unboxed
>>> vectors from Data.Vector.Mutable.Unboxed.
>>
>> That's what I've used.  You find the code on hpaste [1].  It's a
>> carefully optimized Sieve of Eratosthenes and needs around 20 seconds
>> for the primes up to 10^9.  See the refinement in the annotation, which
>> I've added just now.  Before that it took around 35 seconds.
>>
>> I considered that to be "too slow".
>
> Right you are. Unless you need the primes to perform some time-consuming
> calculations with them after sieving, that is too slow.
>
> But let us compare it with a similar C implementation.
>
>>
>> [1] http://hpaste.org/56866
>
> The interesting parts are:
> ========================
> soeST :: forall s. Int -> ST s (STUArray s Int Bool)
> soeST n = do
>    arr <- newArray (0, n) True
>    mapM_ (\i -> writeArray arr i False) [0, 1]
>    let n2 = floor (sqrt (fromIntegral n))
>
>    let loop :: Int -> ST s ()
>        loop i | i > n2 = return ()
>        loop i = do
>            b <- readArray arr i
>
>            let reset :: Int -> ST s ()
>                reset j | j > n = return ()
>                reset j = writeArray arr j False >> reset (j + i)
>
>            when b (reset (i*i))
>
>            loop (succ i)
>
>    loop 2
>    return arr
>
> soeCount :: Int -> Int
> soeCount = length . filter id . elems . soeA
> ========================
>
> With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534
> primes up to 10^9. That's in the same ballpark as your time, maybe a bit
> faster, depends on what 'around 20 seconds' means exactly.
>
> Let's be unfriendly first:
>
> Arracc.h:
> ============
> #include <stdint.h>
>
> inline int readArray(uint64_t *arr, int n, int i);
> inline void writeArray(uint64_t *arr, int n, int i, int b);
> ============
>
> Arracc.c:
> ============
> #include <stdlib.h>
> #include "Arracc.h"
>
> inline int readArray(uint64_t *arr, int n, int i){
>    if (i < 0 || i > n){
>        perror("Error in array index\n");
>        exit(EXIT_FAILURE);
>    }
>    return (arr[i >> 6] >> (i & 63)) & 1;
> }
>
> inline void writeArray(uint64_t *arr, int n, int i, int b){
>    if (i < 0 || i > n){
>        perror("Error in array index\n");
>        exit(EXIT_FAILURE);
>    }
>    if (b) {
>        arr[i >> 6] |= (1ull << (i&63));
>    } else {
>        arr[i >> 6] &= ~(1ull << (i&63));
>    }
> }
> ============
>
> main.c:
> ============
> #include <stdlib.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <math.h>
> #include "Arracc.h"
>
> uint64_t *soeA(int n);
> int pCount(uint64_t *arr, int n);
>
> int main(int argc, char *argv[]){
>    int limit = (argc > 1) ? strtoul(argv[1],NULL,0) : 100000000;
>    uint64_t *sieve = soeA(limit);
>    printf("%d\n",pCount(sieve,limit));
>    free(sieve);
>    return EXIT_SUCCESS;
> }
>
> uint64_t *soeA(int n){
>    int s = (n >> 6)+1;
>    uint64_t *arr = malloc(s*sizeof *arr);
>    if (arr == NULL){
>        perror("Allocation failed\n");
>        exit(EXIT_FAILURE);
>    }
>    int i, j, n2 = (int)sqrt(n);
>    for(i = 0; i < s; ++i){
>        arr[i] = -1;
>    }
>    writeArray(arr,n,0,0);
>    writeArray(arr,n,1,0);
>    for(i = 2; i <= n2; ++i){
>            for(j = i*i; j <= n; j += i){
>                writeArray(arr,n,j,0);
>            }
>        }
>    }
>    return arr;
> }
>
> int pCount(uint64_t *arr, int n){
>    int i, count = 0;
>    for(i = 0; i <= n; ++i){
>            ++count;
>        }
>    }
>    return count;
> }
> ============
>
>
> \$ gcc -c -O3 Arracc.c
> \$ gcc -c -O3 main.c
> \$ gcc -O3 main.o Arracc.o -lm
>
> Takes 18.3 seconds. Oops, slower than the Haskell programme.
>
> Okay, ghc does some cross-module inlining and that enables further
> optimisations which we denied gcc here, so recompile everything with -flto
> additionally. That brings the C version down to 12.8 seconds.
>
> Much better, but still not overwhelming. gcc can do a little better with
> everything in one file, 12.72 seconds.
>
> We can squeeze out a little more by omitting the bounds checks for array
> indexing, 12.63 seconds.
>
> But if we do the same for our Haskell programme, replace
> 14.3 seconds.
>
> I have to admit that I don't know why the bounds-checking costs very little
> in C and a substantial amount in Haskell, but hey, who refuses free
> optimisations.
>
> Lesson 1: Omit pointless array bounds checks.
> An array bounds check is pointless if you have just checked the validity of
> the index on the line above.
>
> Where are we?
>
> 13% slower than C (gcc-4.5.1).
>
> Not a bad result for ghc, but the algorithm is less than optimal.
>
> What's the deal?
>
> We cross off multiples of primes 2549489372 times, and we need (10^9+1)/8
> bytes of memory for the sieve.
>
> Both numbers are higher than is good for us. Since the sieve is so large
> and we cross off the multiples of each prime sequentially until we reach
> the sieve limit, we have lots of cache misses. Bad for performance.
> And each crossing-off takes a couple of clock cycles,
>
> arr[i >> 6] &= ~(1ull << (i&63));
>
> shift index, fetch value, (index & 63), shift 1, complement, and with
> value, store value.
>
> I'm no CPU expert, so I don't know how many cycles that takes, but even
> though some of the operations can be done in parallel on modern CPUs, it's
> still several cycles.
>
> Lesson 2: Avoid duplicate work and redundant data.
> It's easy to separate even and odd numbers, there aren't many even primes,
> and it's unnecessary to mark even numbers as multiples of odd primes.
>
> Separating the crossing-off of even and odd numbers, crossing off only odd
> multiples of odd primes, reduces the number of crossings-off by almost 40%
> and the running time for the C programme to 8.63 seconds (I haven't done
> that for the Haskell programme).
>
> If we go the small step further and eliminate the even numbers from the
> sieve completely, we reduce the required memory by half and the crossings-
> off by almost 60% (the fraction of eliminated crossings-off slowly
> decreases to 50% for higher limits, but as far as today's RAM takes us, it
> remains close to 60%).
>
> At the cost of very slightly more complicated code, we can thus reduce the
> running time to 6.06 seconds (C) resp. 6.49 seconds (Haskell) (about 7%
> slower than C, not bad at all).
>
> That's a pretty good start, for some purposes it's already good enough.
>
> If we need more, the next step is to eliminate multiples of 3 from the
> sieve. That reduces the memory requirements by a further factor of 2/3, and
> the number of crossings-off by a further (mumble, didn't do an exact
> calculation, educated guess) roughly 40% (that fraction decreases to 1/3
> for higher limits, but again very slowly).
>
> The code becomes a bit more complicated again - more than in the previous
> step, but still fairly straightforward.
>
> The running time reduces by another factor of 0.65-0.7 or so. We'd be in
> the region of 4-5 seconds then.
>
> One can eliminate the multiples of further small primes, for each prime the
> additional code complexity increases and the reduction in work/running time
> decreases. At some point, the slowdown from the more complex code
> annihilates the gain from the reduced work. I stopped after eliminating
> multiples of 5, 7 might be worth it, but I'm not convinced.
>
> Using a segmented sieve provides better locality at the cost of more
> complex code. If the sieve is small enough to fit in the L2 cache and large
> enough that some significant work can be done per segment, it's a net gain.
>
> tl,dr: The naive algorithm we started from is too slow for higher limits,
> to get goodish performance, it has to be tweaked heavily. But Haskell plays
> in the same league as C (at least the C I've written).
>
>>
>>>> I haven't tried it, but an equivalent C implementation can easily
>>>> compute the first 10^9 prime numbers within seconds.
>>>
>>> You mean the primes up to 10^9 here?
>>
>> Yes, sorry.  And I was referring to the Sieve of Atkin there, but you
>> are right.  That one is only slightly faster.
>
> Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted
> to my L1 cache, it counts the primes to 10^9 in 0.68 seconds.
> But it's only heavily optimised for primes to 2^32. Above that, my sieve
> catches up (and overtakes it around 5*10^11, yay).
>
>>
>>
>> Greets,
>> Ertugrul
>
> Cheers,
> Daniel
>
> _______________________________________________
> Beginners mailing list