[Haskell-cafe] Unescaping with HaXmL (or anything else!)

Yitzchak Gale gale at sefer.org
Tue Apr 1 11:12:44 EDT 2008


On Fri, Mar 28, 2008 at 4:26 AM, Anton van Straaten wrote:
> I want to unescape an encoded XML or HTML string, e.g. converting "
>  to the quote character, etc.
>  Since I'm using HaXml anyway, I tried using xmlUnEscapeContent with no
>  luck

Hi Anton,

I only noticed your post today, sorry for the delay.

I also need this. In fact, it seems to me that it would be
generally useful. I hope that simple functions to escape/unescape
a string will be added to the API.

In the meantime, you are right that it is a bit tricky
to do this in HaXml. Besides the wrappers that you found
to be needed, there are two other issues:

One issue is that you need to lex and then parse the text first.
If you tell HaXml that your string is a CString, it
will believe you and just use the text the way it is without
any further processing.

The other issue is that HaXml's lexer currently can only
deal with XML content that begins with an XML tag. (I've
pointed this out to Malcolm Wallace, the author of HaXml.)
So in order to use it, you need to wrap your content in a
tag and then unwrap it after parsing.

The code below works for me (obviously it would be better to
remove the "error" calls):

Regards,
Yitz

import Text.XML.HaXml
import Text.XML.HaXml.Parse (xmlParseWith, document)
import Text.XML.HaXml.Lex (xmlLex)

unEscapeXML :: String -> String
unEscapeXML = concatMap ctext . xmlUnEscapeContent stdXmlEscaper .
              unwrapTag .
              either error id . fst . xmlParseWith document .
              xmlLex "oops, lexer failed" . wrapWithTag "t"
  where
    ctext (CString _ txt _)         = txt
    ctext (CRef (RefEntity name) _) = '&' : name ++ ";" -- skipped by escaper
    ctext (CRef (RefChar num) _)    = '&' : '#' : show num ++ ";" -- ditto
    ctext _                 = error "oops, can't unescape non-cdata"
    wrapWithTag t s = concat ["<", t, ">", s, "</", t, ">"]
    unwrapTag (Document _ _ (Elem _ _ c) _) = c
    unwrapTag _                             = error "oops, not wrapped"


More information about the Haskell-Cafe mailing list