[Haskell-cafe] Re: optimization help

jeff p mutjida at gmail.com
Tue Oct 17 23:54:53 EDT 2006


Hello,

> Good, writeCSV writes out every row immediately after it got it. I
> eliminated (++ [nl]) in the hope of reducing the constant factor
> slightly. Using difference lists for that is nicer but here you go.
>
I'm not sure how you'd use difference lists here.

Also, for some reason GHC runs slightly slower (compiled with -O)
using sequencing instead of ++; the version with >> consistently (a
very unrigorous consistence) takes between 5 and 8 seconds longer than
the version with ++ (when filtering out some columns from every row
and writing the new file). Is there some hidden cost in switching from
the mapM_ to a direct sequence?

> Concerning select, one myFilter can be fused away and there is the
> "transpose trick" for filtering out the columns: columns get filtered
> once and for all and (map (`elem` tags)) only needs to be computed once.
> I don't know why the MonadReader is necessary, so I removed it
>
You're right, no need for myFilter on the rows. I used the Reader
because at first I thought it would make a nicer interface (I have
since changed my mind about that) and it didn't seem to hurt
performance.

I think using transpose requires more work (and is slower) than
filtering each row, particularly if the mask (map (`elem` tags)) is
only computed once (shouldn't the compiler do that automatically since
the expression is constant?). Transpose requires passing over the
entire structure at least once (possibly more depending on how clever
the compiler is); so that is a minimum of two complete passes over the
structure just to transpose and untranspose. Filtering each row is
just one pass over the structure (doing very little work assuming the
mask is computed once). My (still very unrigorous) performance tests
seem to bear this out where the transposed version consistently takes
a few seconds longer.

> I hope so... Though the 13 seconds are disproportionally high (only
> 22000 rows to be written) compared to 90 seconds (750000 rows to be
> written).
>
Yes, that does seem suspicious. Perhaps the cost of the testing the
filter on each row accounts for this?

I've attached my current fastest code if you, or anybody, is
interested in taking a shot at tightening it further (or just
interested in lightweight CSV-like processing).

thanks,
  Jeff

----------

readCSV file = do
  v <- B.readFile file
  return $ map (mySplit ',') $ B.lines v

writeCSV file tbl = do
    h <- openFile file WriteMode
    let writeRow = mapM_ (B.hPut h) . (++ [nl]) . intersperse comma
    mapM_ writeRow tbl
    hClose h
  where
    comma = B.singleton ','
    nl = B.singleton '\n'

select targs test (cols : rows) = map narrow (cols : filter (test cols) rows)
  where
    narrow = map snd . filter fst . zip mask
    mask = map (`elem` targs) cols

col x cols row = row !! i
  where Just i = lookup x $ zip cols [0..]

{--
    A slightly smarter ByteString split to deal with quotes and remove
whitespace.
    This could be optimized somewhat (e.g. folding killspace into
mySplit) but the
current performance seems to be as good as ByteString.split-- i.e. using
ByteString.split versus mySplit doesn't affect running time.
--}
mySplit c bs = go False 0 bs where
    go isQuote i xs | i >= B.length xs = [killSpace xs]
                    | isQuote && x0 == '\\'  = go True (i+2) xs
                    | x0 == '"'              = go (not isQuote) (i+1) xs
                    | not isQuote && x0 == c = killSpace x' : go False
0 (B.tail xs')
                    | otherwise              = go isQuote (i+1) xs
      where
        x0 = B.index xs i
        (x',xs') = B.splitAt i xs

killSpace = B.dropWhile isSpace . dropEndWhile isSpace

dropEndWhile test x | B.null x = x
                    | otherwise = if test $ B.last x then dropEndWhile
test $ B.init x else x


More information about the Haskell-Cafe mailing list