Hello Justin,<br><br>I tried and what I saw was a constant increase in memory usage.<br>Any particular profiling option that you would use?<br><br>I do remember that there was a particular option in which the leak would dissapear (for the same amount of work) and that is why I stopped using the profiler.<br>
<br>Thanks,<br><br>Arnoldo<br><br><br><div class="gmail_quote">On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey <span dir="ltr"><<a href="mailto:jgbailey@gmail.com">jgbailey@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Have you use the profiling tools available with GHC?<br>
<br>
<a href="http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html" target="_blank">http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html</a><br>
<div><div></div><div class="h5"><br>
<br>
On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller<br>
<<a href="mailto:arnoldomuller@gmail.com">arnoldomuller@gmail.com</a>> wrote:<br>
> Hello,<br>
><br>
> I am learning haskell and I found a space leak that I find difficult to<br>
> solve. I've been asking at #haskell but we could not solve<br>
> the issue.<br>
><br>
> I want to lazily read a set of 22 files of about 200MB each, filter them and<br>
> then I want to output the result into a unique file.<br>
> If I modify the main function to work only with one input file, the program<br>
> runs without issues. I will call this version A.<br>
> Version B uses a mapM_ to iterate over a list of filenames and uses<br>
> appendFile to output the result of filtering each file.<br>
> In this case the memory usage grows sharply and quickly (profiles show<br>
> constant memory growth). In less than a minute, memory<br>
> occupation will make my system hang with swapping.<br>
><br>
> This is version B:<br>
><br>
> ------------------------------- Program B<br>
> --------------------------------------------------------------------------------------------------------------------<br>
> import Data.List<br>
> import System.Environment<br>
> import System.Directory<br>
> import Control.Monad<br>
><br>
><br>
> -- different types of chromosomes<br>
> data Chromosome = C1<br>
> | C2<br>
> | C3<br>
> | C4<br>
> | C5<br>
> | C6<br>
> | C7<br>
> | C8<br>
> | C9<br>
> | C10<br>
> | C11<br>
> | C12<br>
> | C13<br>
> | C14<br>
> | C15<br>
> | C16<br>
> | C17<br>
> | C18<br>
> | C19<br>
> | CX<br>
> | CY<br>
> | CMT<br>
> deriving (Show)<br>
> -- define a window<br>
> type Sequence = [Char]<br>
> -- Window data<br>
> data Window = Window { sequen :: Sequence,<br>
> chrom :: Chromosome,<br>
> pos :: Int<br>
> }<br>
> -- print a window<br>
> instance Show Window where<br>
> show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)<br>
><br>
> -- Reading fasta files with haskell<br>
><br>
> -- Initialize the<br>
> main = do<br>
> -- get the arguments (intput is<br>
> [input, output, windowSize] <- getArgs<br>
> -- get directory contents (only names)<br>
> names <- getDirectoryContents input<br>
> -- prepend directory<br>
> let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x)<br>
> names<br>
> let wSize = (read windowSize)::Int<br>
> -- process the directories<br>
> mapM (genomeExecute output wSize filterWindow) fullNames<br>
><br>
><br>
> -- read the files one by one and write them to the output file<br>
> genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()<br>
> genomeExecute outputFile windowSize f inputFile = do<br>
> fileData <- readFile inputFile<br>
> appendFile outputFile $ fastaExtractor fileData windowSize f<br>
><br>
> --<br>
> isFastaFile :: String -> Bool<br>
> isFastaFile fileName = isSuffixOf ".fa" fileName<br>
><br>
><br>
> -- fasta extractor (receives a Fasta String and returns a windowed string<br>
> ready to be sorted)<br>
> -- an example on how to compose several functions to parse a fasta file<br>
> fastaExtractor :: String -> Int -> (Window -> Bool) -> String<br>
> fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize<br>
> input<br>
><br>
> -- MAIN FILTER that removes N elements from the strings!<br>
> filterWindow :: Window -> Bool<br>
> filterWindow w = not (elem 'N' (sequen w))<br>
><br>
> -- print a window list (the printing makes it ready for output as raw data)<br>
> printWindowList :: [Window] -> String<br>
> printWindowList l = unlines $ map show l<br>
><br>
> -- read fasta, remove stuff that is not useful from it<br>
> -- removes the<br>
> readFasta :: Int -> [Char] -> [Window]<br>
> readFasta windowSize sequence =<br>
> -- get the header<br>
> let (header:rest) = lines sequence<br>
> chr = parseChromosome header<br>
> in<br>
><br>
> -- We now do the following:<br>
> -- take window create counter<br>
> remove newlines<br>
> map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $<br>
> filter ( '\n' /= ) $ unlines rest<br>
><br>
><br>
> slideWindow :: Int -> [Char] -> [[Char]]<br>
> slideWindow _ [] = []<br>
> slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow<br>
> windowSize xs<br>
><br>
><br>
><br>
> -- Parse the chromosome from a fasta comment<br>
> -- produce a more compact chromosome representation<br>
> parseChromosome :: [Char] -> Chromosome<br>
> parseChromosome line<br>
> | isInfixOf "chromosome 1," line = C1<br>
> | isInfixOf "chromosome 2," line = C2<br>
> | isInfixOf "chromosome 3," line = C3<br>
> | isInfixOf "chromosome 4," line = C4<br>
> | isInfixOf "chromosome 5," line = C5<br>
> | isInfixOf "chromosome 6," line = C6<br>
> | isInfixOf "chromosome 7," line = C7<br>
> | isInfixOf "chromosome 8," line = C9<br>
> | isInfixOf "chromosome 10," line = C10<br>
> | isInfixOf "chromosome 11," line = C11<br>
> | isInfixOf "chromosome 12," line = C12<br>
> | isInfixOf "chromosome 13," line = C13<br>
> | isInfixOf "chromosome 14," line = C14<br>
> | isInfixOf "chromosome 15," line = C15<br>
> | isInfixOf "chromosome 16," line = C16<br>
> | isInfixOf "chromosome 17" line = C17<br>
> | isInfixOf "chromosome 18" line = C18<br>
> | isInfixOf "chromosome 19" line = C19<br>
> | isInfixOf "chromosome X" line = CX<br>
> | isInfixOf "chromosome Y" line = CY<br>
> | isInfixOf "mitochondrion" line = CMT<br>
> | otherwise = error "BAD header"<br>
><br>
><br>
> -------------------------------- End of program B<br>
> ------------------------------------------------------------------------------------------------<br>
><br>
> -------------------------------- Program A<br>
> ---------------------------------------------------------------------------------------------------------<br>
> If instead of the main function given above I use the following main<br>
> function to process only one input file, things work OK for even<br>
> the largest files. Memory usage remains constant in this case.<br>
><br>
> main = do<br>
> -- get the arguments<br>
> [input, output, windowSize] <- getArgs<br>
> -- keep the input stream<br>
> inpStr <- readFile input<br>
> let wSize = (read windowSize)::Int<br>
> writeFile output $ fastaExtractor inpStr wSize filterWindow<br>
><br>
><br>
> It is not easy for me to see why is Haskell keeping data in memory. Do you<br>
> have any idea why program B is<br>
> not working?<br>
><br>
> Thank you for your help!<br>
><br>
> Arnoldo Muller<br>
><br>
</div></div>> _______________________________________________<br>
> Haskell-Cafe mailing list<br>
> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>
> <a href="http://www.haskell.org/mailman/listinfo/haskell-cafe" target="_blank">http://www.haskell.org/mailman/listinfo/haskell-cafe</a><br>
><br>
><br>
</blockquote></div><br>