[Haskell-beginners] Processing a list of files the Haskell way

edgar klerks edgar.klerks at gmail.com
Sat Mar 10 13:32:25 CET 2012


Hi Michael,

Your code has a very C-like feel to it. I would first separate the
reading of the directory structure and the files and the walk over the
tree. Something like this:

data DirTree = FileNode FilePath | DirNode FilePath [DirTree]

walkDirTree :: (FilePath -> a) -> DirTree -> [a]
walkDirTree f (FileNode fp)  = [f fp]
walkDirTree f (DirNode fp fs)  = f fp : (fs >>= (walkDirTree f))


I know this isn't what you need, I didn't read your solution properly
when I wrote it, but it is a useful hint. The separation of the pure
part and the IO part of your program is important.

The problem of the open files is another beast. You are using lazy
bytestrings. Lazy bytestrings can keep the file descriptor open as
long as you haven't read all the bytes. I suspect you need to add some
strictness to your program. You can try to use strict bytestrings. Or
use seq to evaluate the md5 thunks earlier in the program execution.

Greets,

Edgar

On 3/10/12, Michael Schober <Micha-Schober at web.de> wrote:
> Hi everyone,
>
> I'm currently trying to solve a problem in which I have to process a
> long list of files, more specifically I want to compute MD5 checksums
> for all files.
>
> I have code which lists me all the files and holds it in the following
> data structure:
>
> data DirTree = FileNode FilePath | DirNode FilePath [DirTree]
>
> I tried the following:
>
> -- calculates MD5 sums for all files in a dirtree
> addChecksums :: DirTree -> IO [(DirTree,MD5Digest)]
> addChecksums dir = addChecksums' [dir]
>    where
>      addChecksums' :: [DirTree] -> IO [(DirTree,MD5Digest)]
>      addChecksums' [] = return []
>      addChecksums' (f@(FileNode fp):re) = do
>        bytes <- BL.readFile fp
>        rest <- addChecksums' re
>        return ((f,md5 bytes):rest)
>      addChecksums' ((DirNode fp filelist):re) = do
>        efiles <- addChecksums' filelist
>        rest <- addChecksums' re
>        return $ efiles ++ rest
>
>
> This works fine, but only for a small number of files. If I try it on a
> big directory tree, the memory gets junked up and it aborts with an
> error message telling me that there are too many open files.
>
> So I guess, I have to sequentialize the code a little bit more. But at
> the same time, I want to keep it as functional as possible and I don't
> want to write C-like code.
>
> What would be the Haskell way to do something like this?
>
> Thanks for all the input,
> Michael
>
>
>
>
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
>



More information about the Beginners mailing list