[Haskell-cafe] Data.ByteStream.Char8.words performance

Dino Morelli dino at ui3.info
Fri Mar 30 15:38:27 EDT 2007


I noticed something about ByteStream performance that I don't
understand.

I have a test text document:

    $ ls -sh test-text-file
    956K test-text-file


Running this program, using the Prelude's IO functions:

> module Main where
> 
> main = do
>    content <- readFile "test-text-file"
>    let l = length . words $ content
>    print l

I get:

    $ time ./a.out
    174372

    real    0m0.805s
    user    0m0.720s
    sys     0m0.008s


Running a version of the same thing using Data.ByteStream.Char8:

> module Main where
> 
> import qualified Data.ByteString.Char8 as B
> 
> main = do
>    content <- B.readFile "test-text-file"
>    let l = length . B.words $ content
>    print l

I see a time that is quite a bit slower:

    $ time ./a.out
    174372

    real    0m1.864s
    user    0m1.596s
    sys     0m0.012s


Changing it to incorporate similar code to the implementation of
B.words:

> module Main where
> 
> import qualified Data.ByteString.Char8 as B
> import Data.Char (isSpace)
> 
> main = do
>    content <- B.readFile "test-text-file"
>    let l = length $ filter (not . B.null) $ B.splitWith isSpace
>    content
>    print l

I see a similar time as with B.words:

    $ time ./a.out
    174372

    real    0m1.835s
    user    0m1.628s
    sys     0m0.012s


And then if I change this to use B.split ' ' instead of isSpace:

> module Main where
> 
> import qualified Data.ByteString.Char8 as B
> 
> main = do
>    content <- B.readFile "test-text-file"
>    let l = length $ filter (not . B.null) $ B.split ' ' content
>    print l

I get a time that's much more reasonable-looking, compared to the
original Prelude.words version:

    $ time ./a.out
    174313

    real    0m0.389s
    user    0m0.312s
    sys     0m0.004s


It seems like the B.splitWith isSpace code is really slow for some
reason. Anybody have any idea what's going on? The actual implementation
is using isSpaceWord8 which is a case statement looking for a pile of
different whitespace characters.


-- 
  .~.    Dino Morelli
  /V\    email: dino at ui3.info
/( )\   irc: dino-
^^-^^   preferred distro: Debian GNU/Linux  http://www.debian.org


More information about the Haskell-Cafe mailing list