[Haskell-beginners] Space leak while reading from a file?

Bob Ippolito bob at redivi.com
Mon May 12 18:32:54 UTC 2014


I don't know why I ended up looking at an old version of Data.Text but the
copy function has been around for a year, since 0.11.3.0.
http://hackage.haskell.org/package/text-1.1.1.2/docs/Data-Text.html#copy –
you'll want to use it for this use case.


On Mon, May 12, 2014 at 9:54 AM, Bob Ippolito <bob at redivi.com> wrote:

> I haven't looked closely but I suspect if you use foldM to build your Map
> rather than untilM it might consume less memory since you won't be
> allocating this big list.
>
> BUT… the real reason why the memory usage is so much higher than you
> expect is because slicing a Data.Text is an O(1) operation that shares the
> underlying buffer between the original and the slice. Calling T.words will
> ensure that the full original line stays around. You can see this in the
> implementation:
>
> http://hackage.haskell.org/package/text-0.11.2.0/docs/src/Data-Text.html#words
>
> This behavior is documented somewhat near there in the docs, but I think
> it should really be a top-level thing:
> http://hackage.haskell.org/package/text-0.11.2.0/docs/Data-Text.html#g:18
>
> I don't know what function to use to force the array to be copied,
> hopefully there is one! Erlang's binaries work similarly and there is a
> copy function for them for exactly this purpose:
> http://www.erlang.org/doc/man/binary.html
>
> -bob
>
>
>
> On Sun, May 11, 2014 at 4:24 AM, Jan Snajder <jan.snajder at fer.hr> wrote:
>
>> Dear all,
>>
>> I'm trying to implement a simple file-based database. I apparently have
>> a space leak, but I have no clue where it comes from.
>>
>> Here's the file-based database implementation:
>> http://pastebin.com/QqiqcXFw
>>
>> The idea to have a database table in a single textual file. One line
>> equals one table row. The fields within a row are whitespace separated.
>> The first field is the key. Because I'd like to work with large files, I
>> don't want to load the whole file into memory. Instead, I'd like to be
>> able to fetch the rows on demand, by keys. Thus I first create an index
>> that links keys to file seeks. I use the readerT to add the index to the
>> IO monad.
>>
>> For testing, I use a dummy table produced as follows:
>>
>> import System.IO
>> import Text.Printf
>> import Control.Monad
>>
>> row = unwords [printf "field%03d" (i::Int) | i <- [1..999]]
>>
>> main = do
>>   forM_ [1..250000] $ \i ->
>>      putStrLn $ printf "row%06d %s" (i::Int) row
>>
>> This generates a 2.1G textual file, which I store on my disk.
>>
>> The testing code:
>>
>> import FileDB
>> import qualified Data.Text as T
>> import Text.Printf
>> import Control.Applicative
>> import Control.Monad
>> import Control.Monad.Trans
>> import System.IO
>> import System.Environment
>>
>> main = do
>>   (f:_) <- getArgs
>>   t <- openTable f
>>   runDB t $ do
>>     ks <- getKeys
>>     liftIO $ do
>>       putStrLn . printf "%d keys read" $ length ks
>>       putStrLn "Press any key to continue..."
>>       getChar
>>     forM_ ks $ \k -> do
>>       Just r <- getRow k
>>       liftIO . putStrLn $ printf "Row \"%s\" has %d fields"
>>         (T.unpack k) (length r)
>>
>> When I run the test on the 2.1GB file, the whole program consumes 10GB.
>>
>> 6GB seem to be allocated after the index is built (just before entering
>> the forM_ function). The remaining 4GB are allocated while fetching all
>> the rows.
>>
>> I find both things difficult to explain.
>>
>> 6GB seems too much for the index. Each key is 9 characters (stored as
>> Data.Text), and I have 250K such keys in a Data.Map. Should this really
>> add up to 6GB?
>>
>> Also, I have no idea why fetching all the rows, one by one, should
>> consume any additional memory. Each row is fetched and its length is
>> computed and printed out. I see no reason for the rows to be retained in
>> the memory.
>>
>> Here's the memory allocation summary:
>>
>> > 1,093,931,338,632 bytes allocated in the heap
>> >    2,225,144,704 bytes copied during GC
>> >    4,533,898,000 bytes maximum residency (26 sample(s))
>> >    3,080,926,336 bytes maximum slop
>> >            10004 MB total memory in use (0 MB lost due to fragmentation)
>> >
>> >                                     Tot time (elapsed)  Avg pause  Max
>> pause
>> >   Gen  0     2171739 colls,     0 par   45.29s   45.26s     0.0000s
>>  0.0030s
>> >   Gen  1        26 colls,     0 par    1.50s    1.53s     0.0589s
>> 0.7087s
>> >
>> >   INIT    time    0.00s  (  0.00s elapsed)
>> >   MUT     time  279.92s  (284.85s elapsed)
>> >   GC      time   46.80s  ( 46.79s elapsed)
>> >   EXIT    time    0.68s  (  0.71s elapsed)
>> >   Total   time  327.40s  (332.35s elapsed)
>> >
>> >   %GC     time      14.3%  (14.1% elapsed)
>> >
>> >   Alloc rate    3,908,073,170 bytes per MUT second
>> >
>> >   Productivity  85.7% of total user, 84.4% of total elapsed
>>
>>
>> Btw., I don't get the "bytes allocated in the heap" figure, which is
>> approx. 1000 GB (?).
>>
>> I'm obviously doing something wrong here. I'd be thankful for any help.
>>
>> Best,
>> Jan
>> _______________________________________________
>> Beginners mailing list
>> Beginners at haskell.org
>> http://www.haskell.org/mailman/listinfo/beginners
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/beginners/attachments/20140512/4c0bb747/attachment.html>


More information about the Beginners mailing list