[Haskell-beginners] Processing a list of files the Haskell way

Thu Mar 15 12:14:59 CET 2012

On 03/14/2012 01:04 PM, Chaddaï Fouché wrote:
> In my own tests it wasn't very fast : 10min to check a 25GB hierarchy of
> music files, using 13MB of memory maximum. Though I must admit that I
> didn't try to find similar tools to compare so I'm not too certain of
> normal performance times.
I finally got it running after a Haskell-platform update from source to 
resolve some dependencies, but now it works like a charm. Runtime isn't 
an issue for me so far and memory consumption seems to be in acceptable 
levels, so it's a very good start - thanks again.

I took the liberty to modify the output a little bit to my needs - maybe 
a future reader will find it helpful, too. It's attached below.

Best,
Michael

-- 

module Main where

import Data.Conduit.Filesystem (traverse)
import qualified Data.Conduit.List as CL
import Data.Conduit

import Data.Digest.Pure.MD5 (MD5Digest)
import Crypto.Conduit (hashFile)

import qualified Data.Map as M
import qualified Filesystem.Path.CurrentOS as FP
import System.Environment

duplicates :: FilePath -> IO [(MD5Digest,[FilePath])]
duplicates dir = runResourceT $ do
    md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ 
CL.fold buildMap M.empty
    return . M.assocs . M.filter ((>1).length) $ md5s
      where
        process :: FP.FilePath -> IO (MD5Digest, FilePath)
        process fp = do
          let strFp = FP.encodeString fp
          md5 <- hashFile strFp
          return (md5,strFp)
        buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m

main = do
   [dir] <- getArgs
   putStrLn . unlines . map (\(md5,paths) -> (show md5) ++ "-->\n" ++ 
(unlines paths)) =<< duplicates dir