Thanks for the responses everyone, I'll try them out and see what happens :)<br>Andrew<br><br><div class="gmail_quote">On Fri, Jun 8, 2012 at 4:40 PM, Johan Tibell <span dir="ltr"><<a href="mailto:johan.tibell@gmail.com" target="_blank">johan.tibell@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Andrew,<br>
<div class="im"><br>
On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers <<a href="mailto:asm198@gmail.com">asm198@gmail.com</a>> wrote:<br>
> Hi Cafe,<br>
> I'm working on inspecting some data that I'm trying to represent as records<br>
> in Haskell and seeing about twice the memory footprint than I was<br>
> expecting. I've got roughly 1.4 million records in a CSV file (400M on<br>
> disk) that I parse in using bytestring-csv. bytestring-csv returns a<br>
> [[ByteString]] (wrapped in `type`s) which I then convert into a list of<br>
> records that have the following structure:<br>
><br>
>> 3 Int<br>
>> 1 Text Length 3<br>
>> 1 Text Length 11<br>
>> 12 Float<br>
>> 1 UTCTime<br>
><br>
> All fields are marked strict and have {-# UNPACK #-} pragmas (I'm guessing<br>
> that doesn't do anything for non primitives). (Side note, is there a way to<br>
> check if things are actually being unpacked?)<br>
<br>
</div>GHC used to complain when you use UNPACK with something that can't be<br>
unpacked, but that warning seems to have been (accidentally) removed<br>
in 7.4.1.<br>
<br>
The rule for unpacking is:<br>
<br>
* all product types (i.e. types with only one constructor) can be<br>
unpacked. This includes Int, Char, Double, etc and tuples or records<br>
their-of.<br>
* sum types (i.e. data types with more than one constructor) and<br>
polymorphic fields can't be unpacked.<br>
<div class="im"><br>
> My back of the napkin memory estimates based on the assumption that nothing<br>
> is being unpacked (and my very spotty understanding of Haskell data<br>
> structures):<br>
><br>
> Platform: 64 Bit Linux<br>
> # Type (Sizeof type (occasionally a guess))<br>
><br>
> 3 * Int (8)<br>
> 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it can't<br>
> be worse than the same number of Char?<br>
> 12 * Float (4)<br>
> 18 * sizeOf (ptr) (8)<br>
> UTC: -- From what I can gather through :info in ghci<br>
> 4 * (ptr) (8)<br>
> 2 * Integer (16) -- Shouldn't be overly large, times are within 2012<br>
<br>
</div>All fields in a constructor are word aligned. This means that all<br>
primitive types take 8 bytes on a 64-bit platform, including Char and<br>
Float. You might find the following blog posts by me useful in<br>
computing the size of data structures:<br>
<br>
<a href="http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html" target="_blank">http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html</a><br>
<a href="http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html" target="_blank">http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html</a><br>
<a href="http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html" target="_blank">http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html</a><br>
<br>
Here's some more on the topic:<br>
<br>
<a href="http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types" target="_blank">http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types</a><br>
<a href="http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-representations-of-data-types" target="_blank">http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-representations-of-data-types</a><br>
<div class="im"><br>
> I've written a small driver test program that just parses the CSV, finds the<br>
> minimum value for a couple of the Float fields, and exits. In the process<br>
> monitor the memory usage is 6.9G before the program exits. I've tried<br>
> profiling with +RTS -hc but it ran for >3 hours without finishing, it<br>
> normally finishes within 4 minutes. Anyone have any ideas for me? Things<br>
> to try?<br>
> Thanks,<br>
> Andrew<br>
<br>
</div>You could try to use a 32-bit GHC, which would use about half the<br>
memory. You're at the limit of the size of data that you can<br>
comfortably fit in memory on a normal desktop machine, so it might be<br>
time to consider a streaming approach.<br>
<span class="HOEnZb"><font color="#888888"><br>
-- Johan<br>
</font></span></blockquote></div><br>