[Haskell-cafe] Re: optimization help

apfelmus at quantentunnel.de apfelmus at quantentunnel.de
Tue Oct 17 05:38:31 EDT 2006


jeff p wrote:
> Hello,
> 
>> Yet, I'm a bit astonished. I thought that when compiling with -O2,
>> cosmetic changes should become negligible. Perhaps the strict foldl' has
>> an effect?
>>
> Perhaps... but I doubt that is the main reason. At the moment I have
> no idea why there is such a discrepancy between the heap usages...
> 
> A big part of why the solutions you crafted work so efficiently is
> that they take advantage of the fact that the rows will be written out
> exactly as they are read in. I wanted to see if a more general code
> could maintain the same efficiency. Here is some code to read in a
> file, write out a file, and do selections-- the idea is that CSV files
> are internally represented and manipulated as [[ByteString]].
> 
> readCSV file = do
>  v <- B.readFile file
>  return $ map (B.split ',') $ B.lines v

Good, writeCSV writes out every row immediately after it got it. I
eliminated (++ [nl]) in the hope of reducing the constant factor
slightly. Using difference lists for that is nicer but here you go.

> writeCSV file tbl = do
>    h <- openFile file WriteMode
>    mapM_ (writeRow h) tbl
>    hClose h
>  where
>    comma       = B.singleton ','
>    nl          = B.singleton '\n'
>    whriteRow h row =
>        mapM_ (B.hPut h) (intersperse comma row) >> B.hPut h nl

Concerning select, one myFilter can be fused away and there is the
"transpose trick" for filtering out the columns: columns get filtered
once and for all and (map (`elem` tags)) only needs to be computed once.
I don't know why the MonadReader is necessary, so I removed it

> select targs test (cols : rows) = cols : filterCols (filterRows rows)
>  where
>    filterRows = filter (test cols)
>    myFilter   = map snd . filter fst
>    filterCols = transpose . myFilter . zip colflags . transpose
>    colflags   = map (`elem` tags) cols

Concerning col, one should share the index i across different rows. The
compiler is likely not to do a full laziness transformation as this
bears the danger of introducing space leaks (out of the coder's control,
that is).

> col x cols = \row -> row !! i
>     where
>     Just i = lookup (B.pack x) $ zip cols [0..]

A possible test is then something like

> test = (== B.pack "test") . col "COL"


> This code runs reasonably fast-- around 13 seconds to read in a 120MB
> file (~750000 rows), select half the columns of around 22000 rows
> randomly distributed throughout the input table, and write a new CSV
> file. It takes around 90 seconds to just remove some columns from
> every row in the table and write a new file. So the slow part of the
> program is probably the writeCSV function. Do you think these times
> can be improved upon?

I hope so... Though the 13 seconds are disproportionally high (only
22000 rows to be written) compared to 90 seconds (750000 rows to be
written).

Regards,
apfelmus



More information about the Haskell-Cafe mailing list