[Haskell-cafe] NLP libraries and tools?

Aleksandar Dimitrov aleks.dimitrov at googlemail.com
Wed Jul 6 23:58:49 CEST 2011

On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
> On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
> > Hi,
> > Continuing my search of Haskell NLP tools and libs, I wonder if the
> > following Haskell libraries exist (googling them does not help):
> > 1) End of Sentence (EOS) Detection. Break text into a collection of
> > meaningful sentences.
> Depending on how you mean, this is either fairly trivial (for English) or
> an ill-defined problem. For things like determining whether the "."
> character is intended as a full stop vs part of an abbreviation; that's
> trivial.

I disagree. It's not exactly trivial in the sense that it is solved. It is
trivial in the sense that, usually, one would use a list of know abbreviations
and just compare. This, however, just says that the most common approach is
trivial, not that the problem is.

There are cases where, for example, an abbreviation and a full stop will
coincide. In these cases, you'll often need full-blown parsing or at least a
well-trained maxent classifier.

There are other problems: ordinals, acronyms which themselves also have periods
in them, weird names (like Yahoo!) and initials, to name a few. This is only for
English and similar languages, mind you.

> But for general sentence breaking, how do you intend to deal with
> quotations? What about when news articles quote someone uttering a few
> sentences before the end-quote marker? So far as I'm aware, there's no
> satisfactory definition of what the solution should be in all reasonable
> cases. A "sentence" isn't really very well-defined in practice.

As long as you have one routine and stick to it, you don't need a formal
definition every linguist will agree on. Computational Linguists (and their
tools,) more often than not, just need a dependable solution, not a correct one.

> > 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each
> > token.
> There are numerous approaches to this problem; do you care about the
> solution, or will any one of them suffice?
> I've been working over the last year+ on an optimized HMM-based POS
> tagger/supertagger with online tagging and anytime n-best tagging. I'm
> planning to release it this summer (i.e., by the end of August), though
> there are a few things I'd like to polish up before doing so. In
> particular, I want to make the package less monolithic. When I release it
> I'll make announcements here and on the nlp@ list.

I'm very interested in your progress! Keep us posted :-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110706/841c41ef/attachment-0001.pgp>

More information about the Haskell-Cafe mailing list