[Haskell-cafe] Extracting structured data in XML into records

Johan Tibell johan.tibell at gmail.com
Sat Feb 24 03:22:12 EST 2007


Hi!

I'm trying to extract HCards (http://microformats.org/wiki/hcard) from
HTML documents. HCard is a microformat. Microformats is an attempt to
add semantic information to XML documents without adding any new tags.
This is done by adding semantic information in class attributes
instead (see the 'testXml' string below).

I'm trying to find a good way to extract HCards into Haskell records.
To do this I need to map XML elements with certain attribute values
onto record fields. Some of the elements are optional in the XML and I
represent that using Maybe fields in my record. The order of the
elements is not guaranteed only the way they are nested. This makes it
more difficult to first extract the fields I want into a list of
Strings and then map that onto my record since I need to tag each
string with the value it represents.

So my question is. How can I write the function
'extractElementsIntoRecords' below. Or, perhaps HXT is the wrong tool
for the job and I should be trying to walk the DOM tree instead?

> module HCard where
>
> import Text.XML.HXT.Arrow
>
> data HCard = HCard
>     {
>       familyName :: String,
>       givenName :: String
>       org :: Maybe String
>       url :: Maybe String
>     } deriving Show
>
> parseHCards xml = runX $ parseXml xml
>
> parseXml xml =
>     readString [(a_parse_html, v_1)] xml >>>
>     deep (hasClassName "vcard") >>>
>     extractElementsIntoRecords
>
> extractElementsIntoRecords = undefined
>
> hasClassName s = hasAttrValue "class" (elem s . words)
>
> testXml =
>     "<div class=\"vcard red\">" ++
>     " <a class=\"url\" href=\"http://tantek.com/\">" ++
>     "  <span class=\"n\" style=\"display:none\">" ++
>     "   <span class=\"family-name\">Çelik</span>" ++
>     "   <span class=\"given-name\">Tantek</span>" ++
>     "  </span>" ++
>     "  <span class=\"fn\">Tantek Çelik</span>" ++
>     " </a>" ++
>     " <div class=\"org\">" ++
>     "  <span class=\"organization-name\">Technorati</span>" ++
>     " </div>" ++
>     "</div>"
>
> main = parseHCards textXml

Cheers,

Johan Tibell


More information about the Haskell-Cafe mailing list