[Haskell-cafe] Re: optimization help

Tue Oct 17 05:38:31 EDT 2006

jeff p wrote:
> Hello,
> 
>> Yet, I'm a bit astonished. I thought that when compiling with -O2,
>> cosmetic changes should become negligible. Perhaps the strict foldl' has
>> an effect?
>>
> Perhaps... but I doubt that is the main reason. At the moment I have
> no idea why there is such a discrepancy between the heap usages...
> 
> A big part of why the solutions you crafted work so efficiently is
> that they take advantage of the fact that the rows will be written out
> exactly as they are read in. I wanted to see if a more general code
> could maintain the same efficiency. Here is some code to read in a
> file, write out a file, and do selections-- the idea is that CSV files
> are internally represented and manipulated as [[ByteString]].
> 
> readCSV file = do
>  v <- B.readFile file
>  return $ map (B.split ',') $ B.lines v

Good, writeCSV writes out every row immediately after it got it. I
eliminated (++ [nl]) in the hope of reducing the constant factor
slightly. Using difference lists for that is nicer but here you go.

> writeCSV file tbl = do
>    h <- openFile file WriteMode
>    mapM_ (writeRow h) tbl
>    hClose h
>  where
>    comma       = B.singleton ','
>    nl          = B.singleton '\n'
>    whriteRow h row =
>        mapM_ (B.hPut h) (intersperse comma row) >> B.hPut h nl

Concerning select, one myFilter can be fused away and there is the
"transpose trick" for filtering out the columns: columns get filtered
once and for all and (map (`elem` tags)) only needs to be computed once.
I don't know why the MonadReader is necessary, so I removed it

> select targs test (cols : rows) = cols : filterCols (filterRows rows)
>  where
>    filterRows = filter (test cols)
>    myFilter   = map snd . filter fst
>    filterCols = transpose . myFilter . zip colflags . transpose
>    colflags   = map (`elem` tags) cols

Concerning col, one should share the index i across different rows. The
compiler is likely not to do a full laziness transformation as this
bears the danger of introducing space leaks (out of the coder's control,
that is).

> col x cols = \row -> row !! i
>     where
>     Just i = lookup (B.pack x) $ zip cols [0..]

A possible test is then something like

> test = (== B.pack "test") . col "COL"

> This code runs reasonably fast-- around 13 seconds to read in a 120MB
> file (~750000 rows), select half the columns of around 22000 rows
> randomly distributed throughout the input table, and write a new CSV
> file. It takes around 90 seconds to just remove some columns from
> every row in the table and write a new file. So the slow part of the
> program is probably the writeCSV function. Do you think these times
> can be improved upon?

I hope so... Though the 13 seconds are disproportionally high (only
22000 rows to be written) compared to 90 seconds (750000 rows to be
written).

Regards,
apfelmus