[Haskell-cafe] Sneaking haskell in the workplace -- cleaning csv files

Fri Jun 15 17:07:01 EDT 2007

On 15/06/07, Jim Burton <jim at sdf-eu.org> wrote:
>
>
> I need to remove newlines from csv files (within columns, not at the end
> of
> entire lines). This is prior to importing into a database and was being
> done
> at my workplace by a java class for quite a while until the files
> processed
> got bigger and it proved to be too slow. (The files are up to ~250MB at
> the
> moment) It was rewritten in PL/SQL, to run after the import, which was an
> improvement, but it still has our creaky db server thrashing away. (You
> may
> have lots of helpful suggestions in mind, but we can't clean the data at
> source and AFAIK we can't do it incrementally because there is no
> timestamp
> or anything on the last change to a row from the legacy db.)
>
> We don't need a general solution - if a line ends with a delimiter we can
> be
> sure it's the end of the entire line because that's the way the csv files
> are generated.
>
> I had a quick go with ByteString (with no attempt at robustness etc) and
> although I haven't compared it properly it seems faster than what we have
> now. But you can easily make it faster, surely! Hints for improvement
> please
> (e.g. can I unbox anything, make anything strict, or is that handled by
> ByteString, is there a more efficient library function to replace the
> fold...?).
>
> module Main
>     where
> import System.Environment (getArgs)
> import qualified Data.ByteString.Char8 as B
>
> --remove newlines in the middle of 'columns'
> clean :: Char -> [B.ByteString] -> [B.ByteString]
> clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else
> (B.append x $ head ys):(tail ys)) []
>
> main = do args <- getArgs
>           if length args < 2
>            then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]"
>            else do bs <- B.readFile (args!!0)
>                    let d = if length args == 3 then head (args!!2) else
> '"'
>                    B.writeFile (args!!1) $ (B.unlines . clean d . B.lines)
> bs
>
>
Hi,
I haven't compiled this, but you get the general idea:

import qualified Data.ByteString.Lazy.Char8 as B
-- takes a bytestring representing the file, concats the lines
-- then splits it up into "real" lines using the delimiter
clean :: Char -> B.ByteString -> [B.ByteString]
clean' d = B.split d . B.concat . B.lines


-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20070615/341a05f6/attachment.htm