Difference between revisions of "Wc"

Revision as of 06:18, 3 May 2006

Some implementations of the 'wc -l' program in Haskell, with an eye to C-like performance. This illustrates the balance to be made between performance and elegance, over several increasingly fast (and more complex) examples.

Baseline

The baseline is the C program 'wc'

$ du -hs /usr/share/dict/words
892K    /usr/share/dict/words

$ time wc -l /usr/share/dict/words
96030 /usr/share/dict/words
wc -l /usr/share/dict/words  0.00s user 0.00s system 33% cpu 0.017 total

So the best we can probably hope to get is around 0.017s

Standard [Char]

main = print . length . lines =<< getContents

$ ghc -O wc.hs
$ time ./a.out < /usr/share/dict/words
96030
./a.out < /usr/share/dict/words  0.10s user 0.01s system 89% cpu 0.118 total

Ok. 0.118s. About 10x C, as to be expected with a list representation.

Faster [Char]

Perhaps writing our loop, rather than the duplication involved in length . lines, will improve things:

main = interact (count 0)
    where count i []        = show i
          count i ('\n':xs) = count (i+1) xs
          count i (_:xs)    = count i     xs

$ ghc -O wc.hs
$ time ./a.out < /usr/share/dict/words 
96030.
./a.out < /usr/share/dict/words  0.03s user 0.01s system 76% cpu
0.047 total

0.047, rather good!

Data.PackedString

Ok, lets try the old Data.PackedString library.

My first attempt to directly use hGet failed, as hGet has a stack overflow for files > ~500k.

import Data.PackedString
import System.IO

main = print . length . linesPS =<< getit "/usr/share/dict/words"
    where
        getit f = do
            h  <- openFile f ReadMode
            s  <- hGetContents h
            length s `seq` return ()
            hClose h
            return $! packString s

$ time ./a.out
96030
./a.out  0.12s user 0.03s system 90% cpu 0.167 total

Hmm. Worse than [Char]. Unfortunately, this is not uncommon with Data.PackedString.

Data.ByteString 1

Try to improve performance a bit by using the new Data.ByteString library, a replacement for Data.PackedString. This uses packed byte arrays instead of heap-allocated [Char] to represent strings.

import qualified Data.ByteString as B 

main = print . length . B.lines =<< B.getContents

$ ghc -O wc.hs -package fps
$ time ./a.out < /usr/share/dict/words
96030
./a.out < /usr/share/dict/words  0.01s user 0.00s system 42% cpu 0.025 total

0.025s, a bit less than 2x slower than C. Not too bad, and probably a satisfactory place to stop optimising in normal circumstances. Its also nice that the code is similarly concise.

Data.ByteString 2

An alternative is to use the 'count' function to avoid creating the intermediate list:

import qualified Data.ByteString as B

main = print . B.count 10 =<< B.getContents

$ time ./a.out < /usr/share/dict/words
96030
./a.out < /usr/share/dict/words  0.00s user 0.00s system 49% cpu 0.016 total

Ah, much faster and shorter! This is in fact as fast as we'll get.

Ptr hacking

ByteStrings give you access to the underlying pointers to bytes in memory, which can be used to optimise some particular code. So when the ByteString api doesn't provide what you want, you can step inside the ForeignPtr and go nuts.

This example also makes use of a cpp macro to force strictness on a function, via a seq guard case.

import Foreign
import Foreign.ForeignPtr
import System.Environment
import qualified Data.ByteString as B

#define STRICT4(f) f a b c d | a `seq` b `seq` c `seq` d `seq` False = undefined

main = head `fmap` getArgs >>= B.readFile >>= \(B.PS x _ l) ->
    withForeignPtr x $ \p -> go p l 0 0

    where go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
          STRICT4(go)
          go p l n i | n >= l    = print i
                     | otherwise = do (w::Word8) <- peek (p `plusPtr` n)
                                      go p l (n+1) $ if w == 0x0a then (i+1) else i

$ ghc -O -package fps -fglasgow-exts -cpp wc.hs
$ time ./a.out /usr/share/dict/words
96030
./a.out /usr/share/dict/words  0.01s user 0.00s system 47% cpu 0.021 total

A little faster perhaps.

Use the FFI

Try and step around the inefficent need to inspect each character in Haskell, by using memchr(3), the C function to find each newline for us.

import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types

import System.Environment
import qualified Data.ByteString as B

#define STRICT4(f) f a b c d | a `seq` b `seq` c `seq` d `seq` False = undefined

main = do
    f <- head `fmap` getArgs
    B.readFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> go p l 0 0

    where
        go :: Ptr Word8 -> Int -> Int -> Int -> IO ()
        STRICT4(go)
        go p l n i
           | n >= l    = print i
           | otherwise = do
                let p' = p `plusPtr` n
                    q  = memchr p' 0x0a (fromIntegral (l-n))
                if q == nullPtr
                    then print i
                    else do let k = q `minusPtr` p'
                            go p l (n+k+1) (i+1)

foreign import ccall unsafe "string.h memchr" memchr
    :: Ptr Word8 -> CInt -> CSize -> Ptr Word8

$ ghc -O -package fps -cpp -ffi wc.hs
$ time ./a.out /usr/share/dict/words
96030
./a.out /usr/share/dict/words  0.00s user 0.01s system 70% cpu 0.020 total

Slowly inching forwards.

Read the Core

While we're here, we can check whether the strictness on the 'go' function makes any difference, by reading the GHC Core:

$ ghc -O -package fps -cpp -ffi wc.hs -ddump-simpl | less

Search for the 'go' function:

Main.$wgo :: GHC.Prim.Addr#
            -> GHC.Prim.Int#
            -> GHC.Prim.Int#
            -> GHC.Prim.Int#
            -> GHC.IOBase.IO ()

And without the strictness:

Main.$wgo :: GHC.Ptr.Ptr GHC.Word.Word8
            -> GHC.Prim.Int#
            -> GHC.Prim.Int#
            -> GHC.Base.Int
            -> GHC.IOBase.IO ()

So GHC is helpfully unboxing the Ptr Word8 into a raw machine Addr#.

Avoid some code

The guard that checks the length is uneeded, since memchr takes a length argument anyway. It also calculates the next pointer for us, so avoid recalculating it. (Note that this is equivalent to using the 'count' function, which has the same implementation).

import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types

import System.Environment
import qualified Data.ByteString as B

#define STRICT3(f) f a b c | a `seq` b `seq` c `seq` False = undefined

main = do
    f <- head `fmap` getArgs
    B.readFile f >>= \(B.PS x s l) -> withForeignPtr x $ \p -> 
        go (p `plusPtr` s) (fromIntegral l) 0
    where
        go :: Ptr Word8 -> CSize -> Int -> IO ()
        STRICT3(go)
        go p l i = do
            let q  = memchr p 0x0a l
            if q == nullPtr
                then print i
                else do let k = fromIntegral $ q `minusPtr` p
                        go (q `plusPtr` 1) (l-k) (i+1)

foreign import ccall unsafe "string.h memchr" memchr
    :: Ptr Word8 -> CInt -> CSize -> Ptr Word8

Checking the Core, 'go' is now:

Main.$wgo :: GHC.Prim.Addr#
             -> GHC.Prim.Word#
             -> GHC.Prim.Int#
             -> GHC.IOBase.IO ()

The code is certainly a bit simpler, at least.

$ ghc -O -package fps -cpp -ffi wc.hs
$ time ./a.out /usr/share/dict/words
96030
./a.out /usr/share/dict/words  0.00s user 0.00s system 59% cpu 0.020 total

But we can't seem to squeeze any more out, at least on data this size.

Using mmap

The same program as above, but use mmap(2) instead of readFile.

import Foreign
import Foreign.ForeignPtr
import Foreign.C.Types

import System.Environment
import qualified Data.ByteString as B

#define STRICT3(f) f a b c | a `seq` b `seq` c `seq` False = undefined

main = do
    f <- head `fmap` getArgs
    B.mmapFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> go p (fromIntegral l) 0

    where
        go :: Ptr Word8 -> CSize -> Int -> IO ()
        STRICT3(go)
        go p l i
           | otherwise = do
                let q  = memchr p 0x0a l
                if q == nullPtr
                    then print i
                    else do let k = fromIntegral $ q `minusPtr` p
                            go (q `plusPtr` 1) (l - k) (i+1)

foreign import ccall unsafe "string.h memchr" memchr
    :: Ptr Word8 -> CInt -> CSize -> Ptr Word8

$ time ./a.out /usr/share/dict/words
96030
./a.out /usr/share/dict/words  0.00s user 0.00s system 36% cpu 0.019 total

A little faster again.

Going via C

We reach a point where I can't think of any more tricks, so we can always code up a little C and call into that, for this tight loop. Sometimes we just have to do this, and that's what the ffi is for, after all.

-- wc.hs

import Foreign
import System.Environment
import qualified Data.ByteString as B

main = do
    f <- head `fmap` getArgs
    B.mmapFile f >>= \(B.PS x _ l) -> withForeignPtr x $ \p -> print (c_wc p l)

foreign import ccall unsafe "wc.h wc" c_wc :: Ptr Word8 -> Int -> Int

-- wc_c.c
#include <sys/types.h>
#include <unistd.h>

int wc(char *p, int len) {
    int c;
    for (c = 0; len--; ++p)
        if (*p == '\n')
            ++c;
    return c;
}

-- wc.h
int wc(char *p, int len);

$ time ./a.out /usr/share/dict/words
96030
./a.out /usr/share/dict/words  0.00s user 0.00s system 51% cpu 0.017 total

And we are done. Note that the tight C loop didn't give us anything in the end over the naive ByteString code, which is a satisfying result.

@@ Line 14: / Line 14: @@
  $ time wc -l /usr/share/dict/words
 /usr/share/dict/words
- wc -l /usr/share/dict/words  0.00s user 0.00s system 33% cpu 0.018 total
+ wc -l /usr/share/dict/words  0.00s user 0.00s system 33% cpu 0.017 total
-So the best we can probably hope to get is around 0.018s
+So the best we can probably hope to get is around 0.017s
 == Standard [Char] ==
@@ Line 77: / Line 77: @@
 Hmm. Worse than [Char]. Unfortunately, this is not uncommon with Data.PackedString.
-== Data.ByteString ==
+== Data.ByteString 1 ==
 Try to improve performance a bit by using the new
@@ Line 98: / Line 98: @@
 satisfactory place to stop optimising in normal circumstances. Its also nice
 that the code is similarly concise.
+== Data.ByteString 2 ==
 An alternative is to use the 'count' function to avoid creating the
@@ Line 110: / Line 112: @@
  $ time ./a.out < /usr/share/dict/words
- ./a.out < /usr/share/dict/words  0.00s user 0.01s system 47% cpu 0.021 total
+ ./a.out < /usr/share/dict/words  0.00s user 0.00s system 49% cpu 0.016 total
-Ah, much faster and shorter!
+Ah, much faster and shorter! This is in fact as fast as we'll get.
 == Ptr hacking ==
@@ Line 345: / Line 347: @@
  ./a.out /usr/share/dict/words  0.00s user 0.00s system 51% cpu 0.017 total
-And we are done. Note that the tight C loop didn't give us much in the end over
+And we are done. Note that the tight C loop didn't give us anything in the end over
 the naive ByteString code, which is a satisfying result.