Strings are slow

Simon Marlow simonmar@microsoft.com
Tue, 19 Nov 2002 14:23:34 -0000


> module Main where
>=20
> import Text.Regex
> import System.IO
> import System.Environment
> import Control.Monad
> import Data.Maybe
>=20
> main =3D do [re] <- getArgs
> 	  let rx =3D mkRegex re
> 	  let loop =3D do line <- getLine
> 			when (isJust (matchRegex rx line))=20
> (putStrLn line)
> 			eof <- isEOF
> 			unless eof loop
> 	  loop
>=20
>=20
> It turned out that this is remarkably slow. The first problem was with
> inlining. If this is compiled with ghc-5.04 -O -ddump-simpl, I get:
>=20
> 	      case GHC.IOBase.unsafePerformIO
> 		     @ (Data.Maybe.Maybe
> 			    (GHC.Base.String,
> 			     GHC.Base.String,
> 			     GHC.Base.String,
> 			     [GHC.Base.String]))
> 		     (Text.Regex.Posix.regexec=20
> (Text.Regex.mkRegex re) a731)
> 	      of wild4 {

This is indeed an optimiser bug, but it's the result of a design
decision: GHC is a bit laid back about inlining things inside the state
lambda in the IO monad, because it often enables important
optimisations.  However, we're experimenting with modifying this
"optimisation" so that it will be less likely to kill performance in the
way it did in your example.

In the meantime, you can add rx as an argument to loop, that will be
enough to fool GHC into not inlining rx.

> Ie. the regex is compiled anew every time a string is matched. A bug?
>=20
> Anyway, without optimization the code produced is reasonable,=20
> but still
> horrendously slow. Testing with a simple word as a pattern=20
> from a 7.3MB,
> 800kline file, the running time was 37.5 seconds. For=20
> comparison, a similar
> program in mzscheme (interpreted!) took 7.3 seconds while the system
> grep, of course, took 0.4 seconds.
>=20
> I did some profiling by creating new top-level bindings for matchRegex
> and getLine (is there a better way?):
>=20
>=20
>         total time  =3D       53.34 secs   (2667 ticks @ 20 ms)
>         total alloc =3D 1,172,482,496 bytes  (excludes=20
> profiling overheads)
>=20
> COST CENTRE                    MODULE               %time %alloc
>=20
> match                          Main                  69.7   56.8
> getl                           Main                  23.9   40.2
> main                           Main                   6.3    3.0
>=20
>=20
> So it seems like all the time is spent just converting ByteArrays to
> char lists to C arrays. This makes me wonder how sensible it really is
> to represent strings by char lists. Yes, it's nice and=20
> uniform and lazy, but...

String processing in Haskell is very slow, due to the list-of-characters
representation.  A more complete PackedString library with better
integration with other libraries (like Text.Regex) would help a lot for
these kind of examples.

Cheers,
	Simon