[Haskell-beginners] Performance of Idiomatic lazy Haskell

Sun Jan 31 08:39:56 EST 2010

Am Sonntag 31 Januar 2010 13:23:33 schrieb Stephen Tetley:
> Hi Markus
>
> Whoops, I hadn't read your email properly and wasn't accounting for
> the epsilon. Here's a version that does, although it is perhaps a
> little slow...
>

Better use the previous and calculate how many terms you need:

============================================
module Main (main) where

import Data.List (unfoldr)

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ sum $ take n step

step :: [Double]
step = unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))

giving
$ echo '0.00000001' | ./unfoldPi +RTS -sstderr -RTS
./unfoldPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.141592648589476                                                       
  27,305,969,616 bytes allocated in the heap                                                
       2,523,788 bytes copied during GC                                                     
          61,660 bytes maximum residency (1 sample(s))                                      
          38,864 bytes maximum slop                                                         
               1 MB total memory in use (0 MB lost due to fragmentation)                    

  Generation 0: 52083 collections,     0 parallel,  0.29s,  0.44s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   23.10s  ( 23.14s elapsed)
  GC    time    0.29s  (  0.44s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   23.39s  ( 23.57s elapsed)

  %GC time       1.2%  (1.9% elapsed)

  Alloc rate    1,182,002,769 bytes per MUT second

  Productivity  98.8% of total user, 98.0% of total elapsed
===============================================

a little better if we do the unfolding ourselves, not creating any 
intermediate pairs:

leibniz n = (4*) . sum $ unf n True 1

unf :: Int -> Bool -> Double -> [Double]
unf 0 _ _ = []
unf k True n = 1/n : unf (k-1) False (n+2)
unf k False n = negate (1/n) : unf (k-1) True (n+2)

which gives
$ echo '0.00000001' | ./unfPi +RTS -sstderr -RTS
./unfPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.141592648589476                                                       
  15,250,801,168 bytes allocated in the heap                                                
       1,284,632 bytes copied during GC                                                     
          61,592 bytes maximum residency (1 sample(s))                                      
          38,864 bytes maximum slop                                                         
               1 MB total memory in use (0 MB lost due to fragmentation)                    

  Generation 0: 29089 collections,     0 parallel,  0.22s,  0.32s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   18.05s  ( 18.05s elapsed)
  GC    time    0.22s  (  0.32s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   18.28s  ( 18.37s elapsed)

  %GC time       1.2%  (1.7% elapsed)

  Alloc rate    844,773,242 bytes per MUT second

  Productivity  98.8% of total user, 98.2% of total elapsed
=============================================

Still not competitive with the loop version
$ echo '0.00000001' | ./loopPi +RTS -sstderr -RTS
./loopPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.1415926526069526                                                      
         136,248 bytes allocated in the heap                                                
           2,024 bytes copied during GC                                                     
          51,720 bytes maximum residency (1 sample(s))                                      
          38,392 bytes maximum slop                                                         
               1 MB total memory in use (0 MB lost due to fragmentation)                    

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    6.64s  (  6.64s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    6.64s  (  6.64s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    20,517 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

which can also be improved:

{-# LANGUAGE BangPatterns #-}
module Main (main) where

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        k = (mx-1) `quot` 2
        !pi14 = pisum (even k) (fromInteger (2*k+1))
    putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

-- sum from small numbers to large, to reduce cancellation
-- although, in this particular case, for eps = 1e-8, the result is
-- farther off than summing the other way
pisum :: Bool -> Double -> Double
pisum bl start = go bl start 0
      where
        go _ n s | n < 1 = s
        go True n !s = go False (n-2) (s+recip n)
        go False n !s = go True (n-2) (s-recip n)

$ echo '0.00000001' | ./mloopPi +RTS -sstderr -RTS
./mloopPi +RTS -sstderr                                                                      
EPS:                                                                                         
PI mit EPS 1.0e-8 = 3.141592648589793                                                        
         136,416 bytes allocated in the heap
           2,024 bytes copied during GC
          51,720 bytes maximum residency (1 sample(s))
          38,392 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.55s  (  4.56s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.55s  (  4.56s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    29,966 bytes per MUT second

  Productivity  99.9% of total user, 99.8% of total elapsed

Most of the gain is due to the much simpler loop-break test, a small bit 
may be due to bang-patterns vs. seq.

Bottom line:
GHC isn't very good at fusing away intermediate lists in strict algorithms 
(it's better at it for lazy algorithms). Allocating so many cons-cells just 
to be immediately garbage-collected costs a lot of time.

You can code the loop directly, which gives reasonable results, or, as 
Felipe suggested, use the fusion framework created by experts:

module Main (main) where

import qualified Data.List.Stream as S

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ S.sum $ S.take n step

step :: [Double]
step = S.unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))

which beats the hand-coded loop:
$ echo '0.00000001' | ./sunfPi +RTS -sstderr -RTS
./sunfPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.1415926445727678                                                      
         136,560 bytes allocated in the heap
           2,024 bytes copied during GC
          51,720 bytes maximum residency (1 sample(s))
          38,392 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.27s  (  4.27s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.27s  (  4.27s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    31,994 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

> > import Data.List (unfoldr)
> >
> >
> > leibniz eps = converge eps ser
> >
> >
> > ser :: [Double]
> > ser = unfoldr phi (True,1) where
> >   phi (sig,d) | sig == True = Just (1/d, (False,d+2))
> >
> >               | otherwise   = Just (negate (1/d), (True,d+2))
> >
> > converge :: Double -> [Double] -> Double
> > converge eps xs = step 0 0 xs where
> >   step a b (x:xs) = let a' = a + (4*x) in
> >                     if abs (a'-b) < eps then a' else step a' a xs
> >
> > demo1 =  leibniz 0.00000000025
>
> Best wishes
>
> Stephen