Thanks for your comments John. <div>I appreciate your work. I think pandoc is fantastic! </div><div><br></div><div>I'm interested to solve this problem, but time is also an issue.</div><div>I'll try to toy around with it.</div>
<div><br></div><div>Thanks,</div><div><br></div><div>Pieter</div><div><br></div><div><br></div><div><br><div class="gmail_quote">On Tue, Aug 10, 2010 at 7:06 PM, John MacFarlane <span dir="ltr"><<a href="mailto:jgm@berkeley.edu">jgm@berkeley.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hi all,<br>
<br>
I'm the author of zip-archive. I wrote it for a fairly special purpose --<br>
I wanted to create and read ODT files in pandoc -- and I know it could be<br>
improved.<br>
<br>
The main problem is that the parsing algorithm is kind of stupid; it just<br>
reads the whole archive in sequence, storing the files as it comes to them.<br>
So a file listing will take almost as much time as a full extract.<br>
<br>
There's a better way: The zip archive ends with an "end of central directory<br>
record", which contains (among other things) the offset of the central<br>
directory from the beginning of the file. So, one could use something like the<br>
following strategy:<br>
<br>
1. read the "end of central directory record", which should be the last 22<br>
bytes of the file. I think it should be possible to do this without allocating<br>
memory for the whole file.<br>
<br>
2. parse that to determine the offset of the central directory.<br>
<br>
3. seek to the offset of the central directory and parse it. This will give<br>
you a list of file headers. Each file header will tell you the name of a file<br>
in the archive, how it is compressed, and where to find it (its offset) in the<br>
file.<br>
<br>
At this point you'd have the list of files, and enough information to seek to<br>
any file and read it from the archive. The API could be changed to allow lazy<br>
reading of a single file without reading all of them.<br>
<br>
I don't think these changes would be too difficult, since you wouldn't have to<br>
change any of the functions that do the binary parsing -- it would just be a<br>
matter of changing the top-level functions.<br>
<br>
I don't have time to do this right now, but if one of you wants to tackle the<br>
problem, patches are more than welcome! There's some documentation on the ZIP<br>
format in comments in the source code.<br>
<br>
John<br>
<br>
<br>
+++ Neil Brown [Aug 10 10 12:35 ]:<br>
<div><div></div><div class="h5">> On 10/08/10 00:29, Pieter Laeremans wrote:<br>
> >Hello,<br>
> ><br>
> >I'm trying some haskell scripting. I'm writing a script to print<br>
> >some information<br>
> >from a zip archive. The zip-archive library does look nice but<br>
> >the performance of zip-archive/lazy bytestring<br>
> >doesn't seem to scale.<br>
> ><br>
> >Executing :<br>
> ><br>
> > eRelativePath $ head $ zEntries archive<br>
> ><br>
> >on an archive of around 12 MB with around 20 files yields<br>
> ><br>
> >Stack space overflow: current size 8388608 bytes.<br>
> ><br>
> ><br>
> >The script in question can be found at :<br>
> ><br>
> ><a href="http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs" target="_blank">http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs</a><br>
> ><br>
> >I'm using the latest version of haskell platform. Are these<br>
> >libaries not production ready,<br>
> >or am I doing something terribly wrong ?<br>
><br>
> I downloaded your program and compiled it (GHC 6.12.1, zip-archive<br>
> 0.1.1.6, bytestring 0.9.1.5). I ran it on the JVM src.zip (20MB,<br>
> ~8000 files) and it sat there for a minute (67s), taking 2.2% memory<br>
> according to top, then completed successfully. Same behaviour with<br>
> -O2. Which compares very badly in time to the instant return when I<br>
> ran unzip -l on the same file, but I didn't see any memory problems.<br>
> Presumably your archive is valid and works with unzip and other<br>
> tools?<br>
><br>
> Thanks,<br>
><br>
> Neil.<br>
><br>
</div></div><div><div></div><div class="h5">> _______________________________________________<br>
> Haskell-Cafe mailing list<br>
> <a href="mailto:Haskell-Cafe@haskell.org">Haskell-Cafe@haskell.org</a><br>
> <a href="http://www.haskell.org/mailman/listinfo/haskell-cafe" target="_blank">http://www.haskell.org/mailman/listinfo/haskell-cafe</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Pieter Laeremans <<a href="mailto:pieter@laeremans.org">pieter@laeremans.org</a>><br><br>"The future is here. It's just not evenly distributed yet." W. Gibson<br>
</div>