Difference between revisions of "Wc"

From HaskellWiki
Jump to navigation Jump to search
(+ short and fast entry using 'count')
m (→‎Use the FFI: drop use of STRICT macro and use bang patterns)
 
(13 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[Category:Idioms]]
+
[[Category:Code]]
  +
[[Category:Tutorials]]
  +
[[Category:Performance]]
   
 
Some implementations of the 'wc -l' program in Haskell, with an eye to C-like
 
Some implementations of the 'wc -l' program in Haskell, with an eye to C-like
Line 9: Line 11:
 
The baseline is the C program 'wc'
 
The baseline is the C program 'wc'
   
$ du -hs /usr/share/dict/words
+
$ du -hsL /usr/share/dict/words
892K /usr/share/dict/words
+
912K /usr/share/dict/words
 
 
$ time wc -l /usr/share/dict/words
+
$ time wc -l < /usr/share/dict/words
  +
98326
96030 /usr/share/dict/words
 
wc -l /usr/share/dict/words 0.00s user 0.00s system 33% cpu 0.018 total
+
wc -l < /usr/share/dict/words 0.00s user 0.00s system 27% cpu 0.015 total
   
So the best we can probably hope to get is around 0.018s
+
So the best we can probably hope to get is around 0.015s
   
 
== Standard [Char] ==
 
== Standard [Char] ==
   
 
<haskell>
 
<haskell>
  +
main :: IO ()
 
main = print . length . lines =<< getContents
 
main = print . length . lines =<< getContents
 
</haskell>
 
</haskell>
   
 
$ ghc -O wc.hs
 
$ ghc -O wc.hs
$ time ./a.out < /usr/share/dict/words
+
$ time ./wc < /usr/share/dict/words
  +
98326
96030
 
./a.out < /usr/share/dict/words 0.10s user 0.01s system 89% cpu 0.118 total
+
./wc < /usr/share/dict/words 0.10s user 0.00s system 94% cpu 0.106 total
   
Ok. 0.118s. About 10x C, as to be expected with a list representation.
+
Ok. About 10x C, as to be expected with a list representation.
   
 
== Faster [Char] ==
 
== Faster [Char] ==
Line 36: Line 39:
   
 
<haskell>
 
<haskell>
  +
main :: IO ()
 
main = interact (count 0)
 
main = interact (count 0)
 
where count i [] = show i
 
where count i [] = show i
Line 43: Line 47:
   
 
$ ghc -O wc.hs
 
$ ghc -O wc.hs
$ time ./a.out < /usr/share/dict/words
+
$ time ./wc < /usr/share/dict/words
  +
98326./wc < /usr/share/dict/words 0.06s user 0.00s system 87% cpu 0.073 total
96030.
 
./a.out < /usr/share/dict/words 0.03s user 0.01s system 76% cpu
 
0.047 total
 
   
  +
Ok. Not too bad.
0.047, rather good!
 
   
== Data.PackedString ==
+
== Data.ByteString ==
   
  +
Try to improve performance by using the
Ok, lets try the old Data.PackedString library.
 
  +
[http://www.cse.unsw.edu.au/~dons/fps.html Data.ByteString] library. This uses packed byte arrays instead of heap-allocated [Char] to represent strings.
 
My first attempt to directly use hGet failed, as hGet has a stack
 
overflow for files > ~500k.
 
   
 
<haskell>
 
<haskell>
import Data.PackedString
+
import qualified Data.ByteString.Char8 as B
import System.IO
 
   
  +
main :: IO ()
main = print . length . linesPS =<< getit "/usr/share/dict/words"
 
  +
main = B.getContents >>= print . B.count '\n'
where
 
getit f = do
 
h <- openFile f ReadMode
 
s <- hGetContents h
 
length s `seq` return ()
 
hClose h
 
return $! packString s
 
 
</haskell>
 
</haskell>
   
$ time ./a.out
+
$ time ./wc < /usr/share/dict/words
  +
98326
96030
 
./a.out 0.12s user 0.03s system 90% cpu 0.167 total
+
./wc < /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total
  +
  +
Much better, it is now becoming competitive with C. This (and the Data.ByteString.Lazy example below) is as fast as we'll get.
   
  +
== Data.ByteString.Lazy ==
Hmm. Worse than [Char]. Unfortunately, this is not uncommon with Data.PackedString.
 
   
  +
Or we could use the new lazy bytestring type, a lazy list of strict,
== Data.ByteString ==
 
  +
L1-cache-sized chunks of bytes. This example due to Chad Scherrer:
 
Try to improve performance a bit by using the new
 
[http://www.cse.unsw.edu.au/~dons/fps.html Data.ByteString] library, a
 
replacement for Data.PackedString. This uses packed byte arrays instead
 
of heap-allocated [Char] to represent strings.
 
   
 
<haskell>
 
<haskell>
import qualified Data.ByteString as B
+
import qualified Data.ByteString.Lazy.Char8 as L
   
  +
main :: IO ()
main = print . length . B.lines =<< B.getContents
 
  +
main = L.getContents >>= print . L.count '\n'
 
</haskell>
 
</haskell>
   
  +
$ time ./a < /usr/share/dict/words
$ ghc -O wc.hs -package fps
 
  +
98326
$ time ./a.out < /usr/share/dict/words
 
  +
./a < /usr/share/dict/british-english 0.00s user 0.00s system 25% cpu 0.016 total
96030
 
./a.out < /usr/share/dict/words 0.01s user 0.00s system 42% cpu 0.025 total
 
   
  +
== Line-by-line processing ==
0.025s, a bit less than 2x slower than C. Not too bad, and probably a
 
satisfactory place to stop optimising in normal circumstances. Its also nice
 
that the code is similarly concise.
 
   
  +
We can ask the bytestring library to hand us a string line at a time.
An alternative is to use the 'count' function to avoid creating the
 
intermediate list:
 
   
 
<haskell>
 
<haskell>
import qualified Data.ByteString as B
+
import System.IO
  +
import Data.ByteString (hGetLines)
 
main = print . B.count 10 =<< B.getContents
+
main = hGetLines stdin >>= print . length
 
</haskell>
 
</haskell>
   
$ time ./a.out < /usr/share/dict/words
+
$ time ./b < /usr/share/dict/british-english
  +
98326
96030
 
./a.out < /usr/share/dict/words 0.00s user 0.01s system 47% cpu 0.021 total
+
./b < /usr/share/dict/british-english 0.04s user 0.01s system 94% cpu 0.055 total
  +
 
  +
Though this is a bit slower, since it needs to hang on to the lines for
Ah, much faster and shorter!
 
  +
longer.
   
 
== Ptr hacking ==
 
== Ptr hacking ==
Line 142: Line 131:
   
 
$ ghc -O -package fps -fglasgow-exts -cpp wc.hs
 
$ ghc -O -package fps -fglasgow-exts -cpp wc.hs
$ time ./a.out /usr/share/dict/words
+
$ time ./wc /usr/share/dict/words
  +
98326
96030
 
./a.out /usr/share/dict/words 0.01s user 0.00s system 47% cpu 0.021 total
+
./wc /usr/share/dict/words 0.00s user 0.01s system 67% cpu 0.018 total
   
  +
Ok, slower than using length . lines. Lets try some other things.
A little faster perhaps.
 
   
 
== Use the FFI ==
 
== Use the FFI ==
Line 154: Line 143:
   
 
<haskell>
 
<haskell>
  +
{-# LANGUAGE BangPatterns #-}
 
import Foreign
 
import Foreign
 
import Foreign.ForeignPtr
 
import Foreign.ForeignPtr
Line 160: Line 150:
 
import System.Environment
 
import System.Environment
 
import qualified Data.ByteString as B
 
import qualified Data.ByteString as B
 
#define STRICT4(f) f a b c d | a `seq` b `seq` c `seq` d `seq` False = undefined
 
   
 
main = do
 
main = do
Line 169: Line 157:
 
where
 
where
 
go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
 
go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
STRICT4(go)
+
go !p !l !n !i
go p l n i
 
 
| n >= l = print i
 
| n >= l = print i
 
| otherwise = do
 
| otherwise = do
Line 184: Line 171:
 
</haskell>
 
</haskell>
   
  +
$ time ./wc /usr/share/dict/words
$ ghc -O -package fps -cpp -ffi wc.hs
 
  +
98326
$ time ./a.out /usr/share/dict/words
 
  +
./wc /usr/share/dict/words 0.00s user 0.00s system 47% cpu 0.017 total
96030
 
./a.out /usr/share/dict/words 0.00s user 0.01s system 70% cpu 0.020 total
 
   
 
Slowly inching forwards.
 
Slowly inching forwards.
Line 218: Line 204:
 
== Avoid some code ==
 
== Avoid some code ==
   
The guard that checks the length is uneeded, since memchr takes a length argument anyway.
+
The guard that checks the length is unneeded, since memchr takes a length
It also calculates the next pointer for us, so avoid recalculating it.
+
argument anyway. It also calculates the next pointer for us, so avoid
  +
recalculating it. (Note that this is equivalent to using the 'count'
  +
function, which has the same implementation).
   
 
<haskell>
 
<haskell>
Line 233: Line 221:
 
main = do
 
main = do
 
f <- head `fmap` getArgs
 
f <- head `fmap` getArgs
B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> go p (fromIntegral l) 0
+
B.readFile f >>= \(B.PS x s l) -> withForeignPtr x $ \p ->
  +
go (p `plusPtr` s) (fromIntegral l) 0
 
 
where
 
where
 
go :: Ptr Word8 -> CSize -> Int -> IO ()
 
go :: Ptr Word8 -> CSize -> Int -> IO ()
 
STRICT3(go)
 
STRICT3(go)
go p l i
+
go p l i = do
| otherwise = do
+
let q = memchr p 0x0a l
let q = memchr p 0x0a l
+
if q == nullPtr
if q == nullPtr
+
then print i
then print i
+
else do let k = fromIntegral $ q `minusPtr` p
else do let k = fromIntegral $ q `minusPtr` p
+
go (q `plusPtr` 1) (l-k) (i+1)
go (q `plusPtr` 1) (l - k) (i+1)
 
   
 
foreign import ccall unsafe "string.h memchr" memchr
 
foreign import ccall unsafe "string.h memchr" memchr
Line 260: Line 247:
   
 
$ ghc -O -package fps -cpp -ffi wc.hs
 
$ ghc -O -package fps -cpp -ffi wc.hs
$ time ./a.out /usr/share/dict/words
+
$ time ./wc /usr/share/dict/words
  +
98326
96030
 
./a.out /usr/share/dict/words 0.00s user 0.00s system 59% cpu 0.020 total
+
./wc /usr/share/dict/words 0.00s user 0.01s system 70% cpu 0.017 total
   
 
But we can't seem to squeeze any more out, at least on data this size.
 
But we can't seem to squeeze any more out, at least on data this size.
 
== Using mmap ==
 
 
The same program as above, but use mmap(2) instead of readFile.
 
 
<haskell>
 
import Foreign
 
import Foreign.ForeignPtr
 
import Foreign.C.Types
 
 
import System.Environment
 
import qualified Data.ByteString as B
 
 
#define STRICT3(f) f a b c | a `seq` b `seq` c `seq` False = undefined
 
 
main = do
 
f <- head `fmap` getArgs
 
B.mmapFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> go p (fromIntegral l) 0
 
 
where
 
go :: Ptr Word8 -> CSize -> Int -> IO ()
 
STRICT3(go)
 
go p l i
 
| otherwise = do
 
let q = memchr p 0x0a l
 
if q == nullPtr
 
then print i
 
else do let k = fromIntegral $ q `minusPtr` p
 
go (q `plusPtr` 1) (l - k) (i+1)
 
 
foreign import ccall unsafe "string.h memchr" memchr
 
:: Ptr Word8 -> CInt -> CSize -> Ptr Word8
 
</haskell>
 
 
$ time ./a.out /usr/share/dict/words
 
96030
 
./a.out /usr/share/dict/words 0.00s user 0.00s system 36% cpu 0.019 total
 
 
A little faster again.
 
   
 
== Going via C ==
 
== Going via C ==
Line 320: Line 268:
 
main = do
 
main = do
 
f <- head `fmap` getArgs
 
f <- head `fmap` getArgs
B.mmapFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> print (c_wc p l)
+
B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> print (c_wc p l)
   
 
foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int
 
foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int
   
 
-- wc_c.c
 
-- wc_c.c
#include <sys/types.h>
 
#include <unistd.h>
 
 
 
int wc(char *p, int len) {
 
int wc(char *p, int len) {
 
int c;
 
int c;
Line 340: Line 285:
 
</haskell>
 
</haskell>
   
  +
$ gcc -O3 -c wc_c.c
$ time ./a.out /usr/share/dict/words
 
  +
$ ghc -O -package fps wc.hs -o wc -fglasgow-exts wc_c.o
96030
 
./a.out /usr/share/dict/words 0.00s user 0.00s system 51% cpu 0.017 total
+
$ time ./wc /usr/share/dict/words
  +
98326
  +
./wc /usr/share/dict/words 0.00s user 0.00s system 25% cpu 0.016 total
   
And we are done. Note that the tight C loop didn't give us much in the end over
+
And we are done. Note that the tight C loop didn't give us anything in the end over
the naive ByteString code, which is a satisfying result.
+
the naive ByteString code, which is a very satisfying result.

Latest revision as of 10:40, 9 April 2013


Some implementations of the 'wc -l' program in Haskell, with an eye to C-like performance. This illustrates the balance to be made between performance and elegance, over several increasingly fast (and more complex) examples.

Baseline

The baseline is the C program 'wc'

$ du -hsL /usr/share/dict/words
912K    /usr/share/dict/words

$ time wc -l < /usr/share/dict/words 
98326
wc -l < /usr/share/dict/words  0.00s user 0.00s system 27% cpu 0.015 total

So the best we can probably hope to get is around 0.015s

Standard [Char]

main :: IO ()
main = print . length . lines =<< getContents
$ ghc -O wc.hs
$ time ./wc < /usr/share/dict/words
98326
./wc < /usr/share/dict/words  0.10s user 0.00s system 94% cpu 0.106 total

Ok. About 10x C, as to be expected with a list representation.

Faster [Char]

Perhaps writing our loop, rather than the duplication involved in length . lines, will improve things:

main :: IO ()
main = interact (count 0)
    where count i []        = show i
          count i ('\n':xs) = count (i+1) xs
          count i (_:xs)    = count i     xs
$ ghc -O wc.hs
$ time ./wc < /usr/share/dict/words
98326./wc < /usr/share/dict/words  0.06s user 0.00s system 87% cpu 0.073 total

Ok. Not too bad.

Data.ByteString

Try to improve performance by using the Data.ByteString library. This uses packed byte arrays instead of heap-allocated [Char] to represent strings.

import qualified Data.ByteString.Char8 as B

main :: IO ()
main = B.getContents >>= print . B.count '\n'
$ time ./wc < /usr/share/dict/words
98326
./wc < /usr/share/dict/words  0.00s user 0.00s system 25% cpu 0.016 total
 

Much better, it is now becoming competitive with C. This (and the Data.ByteString.Lazy example below) is as fast as we'll get.

Data.ByteString.Lazy

Or we could use the new lazy bytestring type, a lazy list of strict, L1-cache-sized chunks of bytes. This example due to Chad Scherrer:

import qualified Data.ByteString.Lazy.Char8 as L

main :: IO ()
main = L.getContents >>= print . L.count '\n'
$ time ./a < /usr/share/dict/words
98326
./a < /usr/share/dict/british-english  0.00s user 0.00s system 25% cpu 0.016 total

Line-by-line processing

We can ask the bytestring library to hand us a string line at a time.

import System.IO
import Data.ByteString (hGetLines)
main = hGetLines stdin >>= print . length
$ time ./b < /usr/share/dict/british-english
98326
./b < /usr/share/dict/british-english  0.04s user 0.01s system 94% cpu 0.055 total

Though this is a bit slower, since it needs to hang on to the lines for longer.

Ptr hacking

ByteStrings give you access to the underlying pointers to bytes in memory, which can be used to optimise some particular code. So when the ByteString api doesn't provide what you want, you can step inside the ForeignPtr and go nuts.

This example also makes use of a cpp macro to force strictness on a function, via a seq guard case.

import Foreign
import Foreign.ForeignPtr
import System.Environment
import qualified Data.ByteString as B

#define STRICT4(f) f a b c d | a `seq` b `seq` c `seq` d `seq` False = undefined

main = head `fmap` getArgs >>= B.readFile >>= \(B.PS x _ l) ->
    withForeignPtr x $ \p -> go p l 0 0

    where go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
          STRICT4(go)
          go p l n i | n >= l    = print i
                     | otherwise = do (w::Word8) <- peek (p `plusPtr` n)
                                      go p l (n+1) $ if w == 0x0a then (i+1) else i
$ ghc -O -package fps -fglasgow-exts -cpp wc.hs
$ time ./wc /usr/share/dict/words                                                                   
98326       
./wc /usr/share/dict/words  0.00s user 0.01s system 67% cpu 0.018 total

Ok, slower than using length . lines. Lets try some other things.

Use the FFI

Try and step around the inefficent need to inspect each character in Haskell, by using memchr(3), the C function to find each newline for us.

{-# LANGUAGE BangPatterns #-}
import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types

import System.Environment
import qualified Data.ByteString as B

main = do
    f <- head `fmap` getArgs
    B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> go p l 0 0

    where
        go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
        go !p !l !n !i
           | n >= l    = print i
           | otherwise = do
                let p' = p `plusPtr` n
                    q  = memchr p' 0x0a (fromIntegral (l-n))
                if q == nullPtr
                    then print i
                    else do let k = q `minusPtr` p'
                            go p l (n+k+1) (i+1)

foreign import ccall unsafe "string.h memchr" memchr
    :: Ptr Word8 -> CInt -> CSize -> Ptr Word8
$ time ./wc /usr/share/dict/words
98326                            
./wc /usr/share/dict/words  0.00s user 0.00s system 47% cpu 0.017 total

Slowly inching forwards.

Read the Core

While we're here, we can check whether the strictness on the 'go' function makes any difference, by reading the GHC Core:

$ ghc -O -package fps -cpp -ffi wc.hs -ddump-simpl | less

Search for the 'go' function:

Main.$wgo :: GHC.Prim.Addr#
            -> GHC.Prim.Int#
            -> GHC.Prim.Int#
            -> GHC.Prim.Int#
            -> GHC.IOBase.IO ()

And without the strictness:

Main.$wgo :: GHC.Ptr.Ptr GHC.Word.Word8
            -> GHC.Prim.Int#
            -> GHC.Prim.Int#
            -> GHC.Base.Int
            -> GHC.IOBase.IO ()

So GHC is helpfully unboxing the Ptr Word8 into a raw machine Addr#.

Avoid some code

The guard that checks the length is unneeded, since memchr takes a length argument anyway. It also calculates the next pointer for us, so avoid recalculating it. (Note that this is equivalent to using the 'count' function, which has the same implementation).

import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types

import System.Environment
import qualified Data.ByteString as B

#define STRICT3(f) f a b c | a `seq` b `seq` c `seq` False = undefined

main = do
    f <- head `fmap` getArgs
    B.readFile f >>= \(B.PS x s l) -> withForeignPtr x $ \p -> 
        go (p `plusPtr` s) (fromIntegral l) 0
    where
        go :: Ptr Word8 -> CSize -> Int -> IO ()
        STRICT3(go)
        go p l i = do
            let q  = memchr p 0x0a l
            if q == nullPtr
                then print i
                else do let k = fromIntegral $ q `minusPtr` p
                        go (q `plusPtr` 1) (l-k) (i+1)

foreign import ccall unsafe "string.h memchr" memchr
    :: Ptr Word8 -> CInt -> CSize -> Ptr Word8

Checking the Core, 'go' is now:

Main.$wgo :: GHC.Prim.Addr#
             -> GHC.Prim.Word#
             -> GHC.Prim.Int#
             -> GHC.IOBase.IO ()

The code is certainly a bit simpler, at least.

$ ghc -O -package fps -cpp -ffi wc.hs
$ time ./wc /usr/share/dict/words
98326
./wc /usr/share/dict/words  0.00s user 0.01s system 70% cpu 0.017 total

But we can't seem to squeeze any more out, at least on data this size.

Going via C

We reach a point where I can't think of any more tricks, so we can always code up a little C and call into that, for this tight loop. Sometimes we just have to do this, and that's what the ffi is for, after all.

-- wc.hs

import Foreign
import System.Environment
import qualified Data.ByteString as B

main = do
    f <- head `fmap` getArgs
    B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> print (c_wc p l)

foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int

-- wc_c.c
int wc(char *p, int len) {
    int c;
    for (c = 0; len--; ++p)
        if (*p == '\n')
            ++c;
    return c;
}

-- wc.h
int wc(char *p, int len);
$ gcc -O3 -c wc_c.c
$ ghc -O -package fps wc.hs -o wc -fglasgow-exts wc_c.o
$ time ./wc /usr/share/dict/words
98326
./wc /usr/share/dict/words  0.00s user 0.00s system 25% cpu 0.016 total

And we are done. Note that the tight C loop didn't give us anything in the end over the naive ByteString code, which is a very satisfying result.