[Haskell-cafe] parsing machine-generated natural text

Evan Martin martine at danga.com
Fri May 19 21:35:15 EDT 2006


For a toy project I want to parse the output of a program.  The
program runs on someone else's machine and mails me the results, so I
only have access to the output it generates,

Unfortunately, the output is intended to be human-readable, and this
makes parsing it a bit of a pain.  Here are some sample lines from its
output:

France: Army Marseilles SUPPORT Army Paris -> Burgundy.
Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
England:     4 Supply centers,  3 Units:  Builds   1 unit.
The next phase of 'dip' will be Movement for Fall of 1901.

I've been using Parsec and it's felt rather complicated.  For example,
a "location" is a series of words and possibly parenthesis, except if
the word is SUPPORT.  And that "Supply centers" line ends up being
code filled with stuff lie "char ':'; skipMany space".

I actually have a separate parser that's Javascript with a bunch of
regular expressions and it's far shorter than my Haskell one, which
makes sense as munging this sort of text feels to me more like a
regexp job than a careful parsing job.

I'm considering writing a preprocessing stage in Ruby or Perl that
munges those output lines into something a bit more
"machine-readable", but before I did that I thought I'd ask here if
anyone had any pointers, hints, or better ideas.


More information about the Haskell-Cafe mailing list