[Haskell-cafe] NLP libraries and tools?

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Thu Jul 7 09:38:00 CEST 2011


On Wed, Jul 06, 2011 at 07:27:10PM -0700, wren ng thornton wrote:
> I definitely agree with the iteratees comment, but I'm curious about the
> leaks you mention. I haven't run into leakiness issues (that I'm aware of)
> in my use of ByteStrings for NLP.

The issue is this: strict ByteStrings retain pointers to the original chunk. The
chunk is probably bigger than you'd want to keep in memory, if you, say, wanted
to just keep one or two words. In my case, the chunk was some 65K (that was my
Iteratee chunk size.)

There's a thread about it here, where I was fairly desperate in trying to find a
solution to space-behaviour I couldn't understand at all: http://bit.ly/rharIV

The thread is fairly big and in the aftermath in Johan Tibbell posted two very
nice posts about memory consumption of his unordered-containers (which I found
invaluable) and common data types:
blog: http://blog.johantibell.com/

But I think, with today's RAM, this only shows if you try to train models on
huge corpora, like Baroni et al.'s *WAC corpora (which I was using.)

Regards,
Aleks

PS: Another nice thing about iteratees is that writing attoparsec parsers is
often easy, bordering on trivial, and that one can transform them into
iteratees. No need to write your own parsing iteratee (which can be a bit of a
pain in the butt because of all the continuations and the… sometimes
idiosyncratic documentation that I just couldn't wrap my head around. Might also
just be me, though.)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110707/d2171a98/attachment.pgp>


More information about the Haskell-Cafe mailing list