HXT
From HaskellWiki
m |
(→Packages) |
||
| (19 intermediate revisions not shown.) | |||
| Line 3: | Line 3: | ||
[[Category:Tools]] | [[Category:Tools]] | ||
[[Category:Tutorials]] | [[Category:Tutorials]] | ||
| + | [[Category:Libraries]] | ||
== A gentle introduction to the Haskell XML Toolbox == | == A gentle introduction to the Haskell XML Toolbox == | ||
| Line 34: | Line 35: | ||
All packages are available on hackage. | All packages are available on hackage. | ||
| - | ;[http://hackage.haskell.org/package/hxt hxt]:The package [http://hackage.haskell.org/package/hxt hxt] forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, | + | ;[http://hackage.haskell.org/package/hxt hxt]:The package [http://hackage.haskell.org/package/hxt hxt] forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transforming and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data. |
| + | ;[http://hackage.haskell.org/package/HandsomeSoup HandsomeSoup]: HandsomeSoup adds CSS selectors to HXT. | ||
;[http://hackage.haskell.org/package/hxt-http hxt-http]: Native HTTP support is contained in [http://hackage.haskell.org/package/hxt-http hxt-http] and depends on package [http://hackage.haskell.org/package/HTTP HTTP]. | ;[http://hackage.haskell.org/package/hxt-http hxt-http]: Native HTTP support is contained in [http://hackage.haskell.org/package/hxt-http hxt-http] and depends on package [http://hackage.haskell.org/package/HTTP HTTP]. | ||
;[http://hackage.haskell.org/package/hxt-curl hxt-curl]:HTTP support via libCurl and package [http://hackage.haskell.org/package/curl curl] is in [http://hackage.haskell.org/package/hxt-curl hxt-curl]. | ;[http://hackage.haskell.org/package/hxt-curl hxt-curl]:HTTP support via libCurl and package [http://hackage.haskell.org/package/curl curl] is in [http://hackage.haskell.org/package/hxt-curl hxt-curl]. | ||
| Line 42: | Line 44: | ||
;[http://hackage.haskell.org/package/hxt-relaxng hxt-relaxng]: The XPath-, XSLT- and RelaxNG-extensions are separated into [http://hackage.haskell.org/package/hxt-xpath hxt-xpath], [http://hackage.haskell.org/package/hxt-xslt hxt-xslt] and [http://hackage.haskell.org/package/hxt-relaxng hxt-relaxng]. | ;[http://hackage.haskell.org/package/hxt-relaxng hxt-relaxng]: The XPath-, XSLT- and RelaxNG-extensions are separated into [http://hackage.haskell.org/package/hxt-xpath hxt-xpath], [http://hackage.haskell.org/package/hxt-xslt hxt-xslt] and [http://hackage.haskell.org/package/hxt-relaxng hxt-relaxng]. | ||
;Basic packages:There are some basic functionalities, which are not only of interest for HXT, but can be useful for other none XML/HTML related projects. These have been separated too. | ;Basic packages:There are some basic functionalities, which are not only of interest for HXT, but can be useful for other none XML/HTML related projects. These have been separated too. | ||
| - | ;[http://hackage.haskell.org/package/hxt-charproperties hxt-charproperties]: | + | ;[http://hackage.haskell.org/package/hxt-charproperties hxt-charproperties]: defines XML- and Unicode character class properties. |
| - | ;[http://hackage.haskell.org/package/hxt-unicode hxt-unicode]: | + | ;[http://hackage.haskell.org/package/hxt-unicode hxt-unicode]:contains decoding function from various encoding schemes to Unicode. The difference of these functions compared to most of those available on hackage are, that these functions are lazy even in the case of encoding errors (thanks to Henning Thielemann). |
| - | ;[http://hackage.haskell.org/package/hxt-regex-xmlschema hxt-regex-xmlschema]: | + | ;[http://hackage.haskell.org/package/hxt-regex-xmlschema hxt-regex-xmlschema]: contains a lightweight and efficient regex-library. There is full Unicode support, the standard syntax defined in the XML-Schema doc is supported, and there are extensions available for intersection, difference, exclusive OR. The package is self contained, no other regex library is required. The Wiki page [[Regular expressions for XML Schema]] describes the theory behind this regex library and the extensions and gives some usage examples. |
| + | ;[http://hackage.haskell.org/package/hxt-cache hxt-cache]: A cache for storing parsed XML/HTML pages in binary from. This is used in the Holumbus searchengine framework and the Hayoo! API search for speeding up the repeated indexing of pages. | ||
=== Installation === | === Installation === | ||
When installing hxt with cabal, one does not have to deal with all the | When installing hxt with cabal, one does not have to deal with all the | ||
| - | basic packages. Just a <code>cabal install hxt</code> does the work for the core toolbox. When HTTP access is | + | basic packages. Just a |
| + | |||
| + | <code>cabal install hxt</code> | ||
| + | |||
| + | does the work for the core toolbox. When HTTP access is required, install at least one of | ||
the packages hxt-curl or hxt-http. All other packages can be installed | the packages hxt-curl or hxt-http. All other packages can be installed | ||
on demand any time later. | on demand any time later. | ||
| + | |||
| + | === Upgrade from HXT versions < 9.0 === | ||
| + | |||
| + | HXT-9 is not downwards compatible. The splitting into smaller | ||
| + | packages required some internal reorganisation and changes of some type | ||
| + | declarations. | ||
| + | To use the main features of the core package, add an | ||
| + | |||
| + | <haskell> | ||
| + | import Text.XML.HXT.Core | ||
| + | </haskell> | ||
| + | |||
| + | to your sources, instead of <hask>Text.XML.HXT.Arrow</hask>. | ||
| + | |||
| + | The second major change was the kind of configuration and option handling. | ||
| + | This was done previously by lists of key-value-pairs implemented as string. | ||
| + | The growing number of options and the untyped option values have led to | ||
| + | unreliable code. With HXT-9 options are represented by functions with | ||
| + | type save argument types instead of strings. This option handling has to be | ||
| + | modified when switching to the new version. | ||
| + | |||
| + | Examples [[#copyXML|copyXML]] and | ||
| + | [[#Pattern for a main program|Pattern for a main program]] show | ||
| + | the new form of options. | ||
== The basic concepts == | == The basic concepts == | ||
| Line 92: | Line 123: | ||
We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the <hask>XmlFilter</hask> is sufficient. | We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the <hask>XmlFilter</hask> is sufficient. | ||
| - | The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes necessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.), becomes one of the most frequently used combinators. There are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part. | + | The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes necessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like <hask>(.)</hask>, becomes one of the most frequently used combinators. There are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part. |
The first task is to build filters from pure functions, to define a lift operator. Pure functions are lifted to filters in the following way: | The first task is to build filters from pure functions, to define a lift operator. Pure functions are lifted to filters in the following way: | ||
| Line 290: | Line 321: | ||
=== Arrows === | === Arrows === | ||
| - | We've already seen, that the filters <hask>a -> [b]</hask> are a very | + | We've already seen, that the filters <hask>a -> [b]</hask> are a very powerful and sometimes a more elegant way to process XML than pure function. This is the good news. The bad news is, that filter are not general enough. Of course we sometimes want to do some I/O and we want to stay in the filter level. So we need something like |
| - | powerful and sometimes a more elegant way to process XML than pure | + | |
| - | function. This is the good news. The bad news is, that filter are not | + | |
| - | general enough. Of course we sometimes want to do some I/O and we want | + | |
| - | to stay in the filter level. So we need something like | + | |
<haskell> | <haskell> | ||
| Line 302: | Line 329: | ||
for working in the IO monad. | for working in the IO monad. | ||
| - | Sometimes it's appropriate to thread some state through the computation | + | Sometimes it's appropriate to thread some state through the computation like in state monads. This leads to a type like |
| - | like in state monads. This leads to a type like | + | |
<haskell> | <haskell> | ||
| Line 309: | Line 335: | ||
</haskell> | </haskell> | ||
| - | And in real world applications we need both extensions at the same | + | And in real world applications we need both extensions at the same time. Of course I/O is necessary but usually there are also some global options and variables for controlling the computations. In HXT, for instance there are variables for controlling trace output, options for setting the default encoding scheme for input data and a base URI for accessing documents, which are addressed in a content or in a DTD part by relative URIs. So we need something like |
| - | time. Of course I/O is necessary but usually there are also some | + | |
| - | global options and variables for controlling the computations. In HXT, | + | |
| - | for instance there are variables for controlling trace output, options | + | |
| - | for setting the default encoding scheme for input data and a base URI | + | |
| - | for accessing documents, which are addressed in a content or in a DTD | + | |
| - | part by relative URIs. So we need something like | + | |
<haskell> | <haskell> | ||
| Line 321: | Line 341: | ||
</haskell> | </haskell> | ||
| - | We want to work with all four filter variants, and in the future | + | We want to work with all four filter variants, and in the future perhaps with even more general filters, but of course not with four sets of filter names, e.g. <hask>deep, deepST, deepIO, deepIOST</hask>. |
| - | perhaps with even more general filters, but of course not with four | + | |
| - | sets of filter names, e.g. <hask>deep, deepST, deepIO, deepIOST</hask>. | + | |
| - | This is the point where <hask>newtype</hask>s and <hask>class</hask>es | + | This is the point where <hask>newtype</hask>s and <hask>class</hask>es come in. Classes are needed for overloading names and <hask>newtype</hask>s are needed to declare instances. Further the restriction of <hask>XmlTree</hask> as argument and result type is not neccessary and hinders reuse in many cases. |
| - | come in. Classes are needed for overloading names and | + | |
| - | <hask>newtype</hask>s are needed to declare instances. Further the | + | |
| - | restriction of <hask>XmlTree</hask> as argument and result type is | + | |
| - | not neccessary and hinders reuse in many cases. | + | |
| - | A filter discussed above has all features of an arrow. Arrows are | + | A filter discussed above has all features of an arrow. Arrows are introduced for generalising the concept of functions and function combination to more general kinds of computation than pure functions. |
| - | introduced for generalising the concept of functions and function | + | |
| - | combination to more general kinds of computation than pure functions. | + | |
| - | A basic set of combinators for arrows is defined in the classes in the | + | A basic set of combinators for arrows is defined in the classes in the <hask>Control.Arrow</hask> module, containing the above mentioned <hask>(>>>), (<+>), arr</hask>. |
| - | <hask>Control.Arrow</hask> module, containing the above mentioned <hask>(>>>), (<+>), arr</hask>. | + | |
| - | In HXT the additional classes for filters working with lists as result type are | + | In HXT the additional classes for filters working with lists as result type are defined in <hask>Control.Arrow.ArrowList</hask>. The choice operators are in <hask>Control.Arrow.ArrowIf</hask>, tree filters, like <hask>getChildren, deep, multi, ...</hask> in <hask>Control.Arrow.ArrowTree</hask> and the elementary XML specific filters in <hask>Text.XML.HXT.XmlArrow</hask>. |
| - | defined in <hask>Control.Arrow.ArrowList</hask>. The choice operators are | + | |
| - | in <hask>Control.Arrow.ArrowIf</hask>, tree filters, like <hask>getChildren, deep, multi, ...</hask> in | + | |
| - | <hask>Control.Arrow.ArrowTree</hask> and the elementary XML specific | + | |
| - | filters in <hask>Text.XML.HXT.XmlArrow</hask>. | + | |
| - | In HXT there are four types instantiated with these classes for | + | In HXT there are four types instantiated with these classes for pure list arrows, list arrows with a state, list arrows with IO and list arrows with a state and IO. |
| - | pure list arrows, list arrows with a state, list arrows with IO | + | |
| - | and list arrows with a state and IO. | + | |
<haskell> | <haskell> | ||
| Line 358: | Line 363: | ||
</haskell> | </haskell> | ||
| - | The first one and the last one are those used most frequently in the | + | The first one and the last one are those used most frequently in the toolbox, and of course there are lifting functions for converting general arrows into more specific arrows. |
| - | toolbox, and of course there are lifting functions for converting | + | |
| - | general arrows into more specific arrows. | + | |
| - | Don't worry about all these conceptual details. Let's have a look into some | + | Don't worry about all these conceptual details. Let's have a look into some ''Hello world'' examples. |
| - | ''Hello world'' examples. | + | |
== Getting started: Hello world examples == | == Getting started: Hello world examples == | ||
| Line 369: | Line 371: | ||
=== copyXML === | === copyXML === | ||
| - | The first complete example is a program for | + | The first complete example is a program for copying an XML document |
| - | copying an XML document | + | |
<haskell> | <haskell> | ||
| Line 397: | Line 398: | ||
</haskell> | </haskell> | ||
| - | The interesting part of this example is | + | The interesting part of this example is the call of <hask>runX</hask>. <hask>runX</hask> executes an arrow. This arrow is one of the more powerful list arrows with IO and a HXT system state. |
| - | the call of <hask>runX</hask>. <hask>runX</hask> executes an | + | |
| - | arrow. This arrow is one of the more powerful list arrows with IO and | + | |
| - | a HXT system state. | + | |
| - | The arrow itself is a composition of <hask>readDocument</hask> and | + | The arrow itself is a composition of <hask>readDocument</hask> and <hask>writeDocument</hask>. |
| - | <hask>writeDocument</hask>. | + | |
| - | <hask>readDocument</hask> is an arrow for reading, DTD processing and | + | <hask>readDocument</hask> is an arrow for reading, DTD processing and validation of documents. Its behaviour can be controlled by a list of system options. Here we turn off the validation step. The <hask>src</hask>, a file name or an URI is read and parsed and a document tree is built. |
| - | validation of documents. Its behaviour can be controlled by a list of | + | |
| - | system options. Here we turn off the validation step. The <hask>src</hask>, a file | + | |
| - | name or an URI is read and parsed and a document tree is built. | + | |
| - | The input option <hask>withCurl []</hask> enables reading via HTTP. | + | The input option <hask>withCurl []</hask> enables reading via HTTP. For using this option, the extra package hxt-curl must be installed, and <hask>withCurl</hask> must be imported by <hask>import Text.XML.HXT.Curl</hask>. |
| - | For using this option, the extra package hxt-curl must be installed, | + | If only file access is necessary, this option and the import can be dropped. In that case the program does not depend on the libCurl binding. |
| - | and <hask>withCurl</hask> must be imported by | + | |
| - | <hask>import Text.XML.HXT.Curl</hask>. | + | |
| - | If only file access is necessary, this option and the import | + | |
| - | can be dropped. In that case the program does not depend on | + | |
| - | the libCurl binding. | + | |
| - | The | + | The tree read in is ''piped'' into the output arrow. This one again is controlled by a set of system options. The <hask>withIndent</hask> option controlls the output formatting, here indentation is switche on, the <hask>withOutputEncoding</hask> is set to IOS Latin1. |
| - | tree read in is ''piped'' into the output arrow. This one again is | + | |
| - | controlled by a set of system options. The <hask>withIndent</hask> option | + | |
| - | controlls the output formatting, here indentation is switche on, | + | |
| - | the <hask>withOutputEncoding</hask> | + | |
| - | is set to IOS Latin1 | + | |
| - | + | ||
| - | + | ||
| - | We've omitted here the boring stuff of option parsing and error | + | <hask>writeDocument</hask> converts the tree into a string and writes it to the <hask>dst</hask>. |
| - | handling. | + | |
| + | We've omitted here the boring stuff of option parsing and error handling. | ||
Compilation and a test run looks like this: | Compilation and a test run looks like this: | ||
| Line 444: | Line 427: | ||
</pre> | </pre> | ||
| - | The mini XML document in file <tt>hello.xml</tt> is read and | + | The mini XML document in file <tt>hello.xml</tt> is read and a document tree is built. Then this tree is converted into a string and written to standard output (filename: <tt>-</tt>). It is decorated with an XML declaration containing the version and the output encoding. |
| - | a document tree is built. Then this tree is converted into a string | + | |
| - | and written to standard output (filename: <tt>-</tt>). It is decorated | + | |
| - | with an XML declaration containing the version and the output | + | |
| - | encoding. | + | |
| - | For processing HTML documents there is a HTML parser, which tries to | + | For processing HTML documents there is a HTML parser, which tries to parse and interpret rather anything as HTML. The HTML parser can be selected by calling |
| - | parse and interpret rather anything as HTML. The HTML parser can be | + | |
| - | selected by calling | + | |
<hask>readDocument [withParseHTML yes, ...]</hask> | <hask>readDocument [withParseHTML yes, ...]</hask> | ||
| - | The available read and write options can be found in the hxt | + | The available read and write options can be found in the hxt module <hask>Text.XML.HXT.Arrow.XmlState.SystemConfig</hask> |
| - | module <hask>Text.XML.HXT.Arrow.XmlState.SystemConfig</hask> | + | |
=== Pattern for a main program === | === Pattern for a main program === | ||
| Line 469: | Line 445: | ||
import Text.XML.HXT.Core | import Text.XML.HXT.Core | ||
| + | import Text.XML.HXT.... -- further HXT packages | ||
import System.IO | import System.IO | ||
| Line 502: | Line 479: | ||
processChildren (processDocumentRootElement `when` isElem) -- (1) | processChildren (processDocumentRootElement `when` isElem) -- (1) | ||
>>> | >>> | ||
| - | writeDocument [] dst | + | writeDocument [] dst -- (3) |
>>> | >>> | ||
getErrStatus | getErrStatus | ||
| Line 520: | Line 497: | ||
In line (0) the system is configured with the list of options. | In line (0) the system is configured with the list of options. | ||
These options are then used as defaults for all read and write operation. | These options are then used as defaults for all read and write operation. | ||
| - | The options can be overwritten for single read | + | The options can be overwritten for single read/write calls |
| - | by putting config options into the parameter list of the calls. | + | by putting config options into the parameter list of the |
| + | read/write function calls. | ||
The interesing line is (1). | The interesing line is (1). | ||
| Line 544: | Line 522: | ||
The structure of internal document tree can be made visible | The structure of internal document tree can be made visible | ||
| - | e.g. by adding the option | + | e.g. by adding the option <hask>withShowTree yes</hask> to the |
| - | <hask>writeDocument</hask> arrow. This will emit the tree in a readable | + | <hask>writeDocument</hask> arrow in (3). |
| + | This will emit the tree in a readable | ||
text representation instead of the real document. | text representation instead of the real document. | ||
In the next section we will give examples for the | In the next section we will give examples for the | ||
<hask>processDocumentRootElement</hask> arrow. | <hask>processDocumentRootElement</hask> arrow. | ||
| + | |||
| + | === Tracing === | ||
| + | |||
| + | There are tracing facilities to observe the actions performed | ||
| + | and to show intermediate results | ||
| + | |||
| + | <haskell> | ||
| + | application :: SysConfigList -> String -> String -> IOSArrow b Int | ||
| + | application cfg src dst | ||
| + | = configSysVars (withTrace 1 : cfg) -- (0) | ||
| + | >>> | ||
| + | traceMsg 1 "start reading document" -- (1) | ||
| + | >>> | ||
| + | readDocument [] src | ||
| + | >>> | ||
| + | traceMsg 1 "document read, start processing" -- (2) | ||
| + | >>> | ||
| + | processChildren (processDocumentRootElement `when` isElem) | ||
| + | >>> | ||
| + | traceMsg 1 "document processed" -- (3) | ||
| + | >>> | ||
| + | writeDocument [] dst | ||
| + | >>> | ||
| + | getErrStatus | ||
| + | </haskell> | ||
| + | |||
| + | In (0) the system trace level is set to 1, in default level 0 | ||
| + | all trace messages are suppressed. The three trace messages (1)-(3) | ||
| + | will be issued, but also readDocument and writeDocument will | ||
| + | log their activities. | ||
| + | |||
| + | How a whole document and the internal tree structure can be traced, | ||
| + | is shown in the following example | ||
| + | |||
| + | <haskell> | ||
| + | ... | ||
| + | >>> | ||
| + | processChildren (processDocumentRootElement `when` isElem) | ||
| + | >>> | ||
| + | withTraceLevel 4 (traceDoc "resulting document") -- (1) | ||
| + | >>> | ||
| + | ... | ||
| + | </haskell> | ||
| + | |||
| + | In (1) the trace level is locally set to the highest level 4. | ||
| + | traceDoc will then issue the trace message, the document formatted | ||
| + | as XML, and the internal DOM tree of the document. | ||
== Selection examples == | == Selection examples == | ||
| Line 561: | Line 587: | ||
selectAllText :: ArrowXml a => a XmlTree XmlTree | selectAllText :: ArrowXml a => a XmlTree XmlTree | ||
selectAllText | selectAllText | ||
| - | = deep | + | = deep isText |
</haskell> | </haskell> | ||
<hask>deep</hask> traverses the whole tree, stops the traversal when | <hask>deep</hask> traverses the whole tree, stops the traversal when | ||
| - | a node is a text node (<hask> | + | a node is a text node (<hask>isText</hask>) and returns all the text nodes. |
There are two other traversal operators <hask>deepest</hask> and <hask>multi</hask>, | There are two other traversal operators <hask>deepest</hask> and <hask>multi</hask>, | ||
In this case, where the selected nodes are all leaves, these would give the same result. | In this case, where the selected nodes are all leaves, these would give the same result. | ||
| Line 578: | Line 604: | ||
selectAllTextAndAltValues | selectAllTextAndAltValues | ||
= deep | = deep | ||
| - | ( | + | ( isText -- (1) |
<+> | <+> | ||
( isElem >>> hasName "img" -- (2) | ( isElem >>> hasName "img" -- (2) | ||
| Line 601: | Line 627: | ||
selectAllTextAndRealAltValues | selectAllTextAndRealAltValues | ||
= deep | = deep | ||
| - | ( | + | ( isText |
<+> | <+> | ||
( isElem >>> hasName "img" | ( isElem >>> hasName "img" | ||
| Line 658: | Line 684: | ||
root [] [helloWorld] -- (1) | root [] [helloWorld] -- (1) | ||
>>> | >>> | ||
| - | writeDocument [ | + | writeDocument [withIndent yes] "hello.xml" -- (2) |
</haskell> | </haskell> | ||
| Line 666: | Line 692: | ||
<hask>writeDocument</hask> and its variants always expect | <hask>writeDocument</hask> and its variants always expect | ||
a whole document tree with such a root node. Before writing, the document is | a whole document tree with such a root node. Before writing, the document is | ||
| - | indented (<hask> | + | indented (<hask>withIndent yes</hask>)) by inserting extra whitespace |
text nodes, and an XML declaration with version and encoding is added. If the indent option is not given, the whole document would appears on a single line: | text nodes, and an XML declaration with version and encoding is added. If the indent option is not given, the whole document would appears on a single line: | ||
| Line 1,257: | Line 1,283: | ||
More complex and complete examples of HXT in action | More complex and complete examples of HXT in action | ||
can be found in [[HXT/Practical]] | can be found in [[HXT/Practical]] | ||
| + | |||
| + | === The Complete Guide To Working With HTML === | ||
| + | |||
| + | Tutorial and Walkthrough: http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html | ||
Revision as of 22:19, 24 April 2012
1 A gentle introduction to the Haskell XML Toolbox
The Haskell XML Toolbox (HXT) is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language consisting of a set of combinators for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace aware XML-Parser that supports almost fully the XML 1.0 Standard. Extensions are a validator for RelaxNG and an XPath evaluator.
Contents |
2 Background
The Haskell XML Toolbox is based on the ideas of HaXml and HXML, but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.
HXT has grown over the years. Components for XPath, XSLT, validation with RelaxNG, picklers for conversion from/to native Haskell data, lazy parsing with tagsoup, input via curl and native Haskell HTTP and others have been added. This has led to a rather large package with a lot of dependencies.
To make the toolbox more modular and to reduce the dependencies on other packages, hxt has been split into various smaller packages since version 9.0.0.
3 Resources
3.1 Home Page and Repositoy
- HXT
- The project home for HXT
- HXT on GitHub
- The git source repository on github for all HXT packages
3.2 Packages
All packages are available on hackage.
- hxt
- The package hxt forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transforming and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data.
- HandsomeSoup
- HandsomeSoup adds CSS selectors to HXT.
- hxt-http
- Native HTTP support is contained in hxt-http and depends on package HTTP.
- hxt-curl
- HTTP support via libCurl and package curl is in hxt-curl.
- hxt-tagsoup
- The lazy tagsoup parser can be found in package hxt-tagsoup, only this package depends on Neil Mitchell's tagsoup.
- hxt-xpath
- hxt-xslt
- hxt-relaxng
- The XPath-, XSLT- and RelaxNG-extensions are separated into hxt-xpath, hxt-xslt and hxt-relaxng.
- Basic packages
- There are some basic functionalities, which are not only of interest for HXT, but can be useful for other none XML/HTML related projects. These have been separated too.
- hxt-charproperties
- defines XML- and Unicode character class properties.
- hxt-unicode
- contains decoding function from various encoding schemes to Unicode. The difference of these functions compared to most of those available on hackage are, that these functions are lazy even in the case of encoding errors (thanks to Henning Thielemann).
- hxt-regex-xmlschema
- contains a lightweight and efficient regex-library. There is full Unicode support, the standard syntax defined in the XML-Schema doc is supported, and there are extensions available for intersection, difference, exclusive OR. The package is self contained, no other regex library is required. The Wiki page Regular expressions for XML Schema describes the theory behind this regex library and the extensions and gives some usage examples.
- hxt-cache
- A cache for storing parsed XML/HTML pages in binary from. This is used in the Holumbus searchengine framework and the Hayoo! API search for speeding up the repeated indexing of pages.
3.3 Installation
When installing hxt with cabal, one does not have to deal with all the basic packages. Just a
cabal install hxt
does the work for the core toolbox. When HTTP access is required, install at least one of the packages hxt-curl or hxt-http. All other packages can be installed on demand any time later.
3.4 Upgrade from HXT versions < 9.0
HXT-9 is not downwards compatible. The splitting into smaller packages required some internal reorganisation and changes of some type declarations. To use the main features of the core package, add an
import Text.XML.HXT.Core
The second major change was the kind of configuration and option handling. This was done previously by lists of key-value-pairs implemented as string. The growing number of options and the untyped option values have led to unreliable code. With HXT-9 options are represented by functions with type save argument types instead of strings. This option handling has to be modified when switching to the new version.
Examples copyXML and Pattern for a main program show the new form of options.
4 The basic concepts
4.1 The basic data structures
Processing of XML is a task of processing tree structures. This is can be done in Haskell in a very elegant way by defining an appropriate tree data type, a Haskell DOM (document object model) structure. The tree structure in HXT is a rose tree with a special XNode data type for storing the XML node information.
The generally useful tree structure (NTree) is separated from the node type (XNode). This allows for reusing the tree structure and the tree traversal and manipulation functions in other applications.
data NTree a = NTree a [NTree a] -- rose tree data XNode = XText String -- plain text node | ... | XTag QName XmlTrees -- element name and list of attributes | XAttr QName -- attribute name | ... type QName = ... -- qualified name type XmlTree = NTree XNode type XmlTrees = [XmlTree]
4.2 The concept of filters
Selecting, transforming and generating trees often requires routines, which compute not only a single result tree, but a (possibly empty) list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions, which take an XML tree as input and compute a list of result trees.
type XmlFilter = XmlTree -> [XmlTree]
More generally we can define a filter as
type Filter a b = a -> [b]
The first task is to build filters from pure functions, to define a lift operator. Pure functions are lifted to filters in the following way:
Predicates are lifted by mapping False to the empty list and True to the single element list, containing the input tree.
p :: XmlTree -> Bool -- pure function p t = ... pf :: XmlTree -> [XmlTree] -- or XmlFilter pf t | p t = [t] | otherwise = []
isA :: (a -> Bool) -> (a -> [a]) isA p x | p x = [x] | otherwise = []
A predicate for filtering text nodes looks like this
isXText :: XmlFilter -- XmlTree -> [XmlTree] isXText t@(NTree (XText _) _) = [t] isXText _ = []
Transformers -- functions that map a tree into another tree -- are lifted in a trivial way:
f :: XmlTree -> XmlTree f t = exp(t) ff :: XmlTree -> [XmlTree] ff t = [exp(t)]
Partial functions, functions that can't always compute a result, are usually lifted to totally defined filters:
f :: XmlTree -> XmlTree f t | p t = expr(t) | otherwise = error "f not defined" ff :: XmlFilter ff t | p t = [expr(t)] | otherwise = []
This is a rather comfortable situation, with these filters we don't have to deal with illegal argument errors. Illegal arguments are just mapped to the empty list.
When processing trees, there's often the case, that no, exactly one, or more than one result is possible. These functions, returning a set of results are often a bit imprecisely called nondeterministic functions. These functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible.
Working with filters is rather similar to working with binary relations, and working with relations is rather natural and comfortable, database people know this very well.
Two first examples for working with nondeterministic functions are selecting the children and the grandchildren of an XmlTree which can be implemented by
getChildren :: XmlFilter getChildren (NTree n cs) = cs getGrandChildren :: XmlFilter getGrandChildren (NTree n cs) = concat [ getChildren c | c <- cs ]
4.3 Filter combinators
Composition of filters (like function composition) is the most important combinator. We will use the infix operator(>>>) :: XmlFilter -> XmlFilter -> XmlFilter (f >>> g) t = concat [g t' | t' <- f t]
getGrandChildren :: XmlFilter getGrandChildren = getChildren >>> getChildren
getTextChildren :: XmlFilter getTextChildren = getChildren >>> isXText
(<+>) :: XmlFilter -> XmlFilter -> XmlFilter (f <+> g) t = f t ++ g t
Combining elementary filters with (>>>) and (<+>) leads to more complex functionality. For example, selecting all text nodes within two levels of depth (in left to right order) can be formulated with:
getTextChildren2 :: XmlFilter getTextChildren2 = getChildren >>> ( isXText <+> ( getChildren >>> isXText ) )
Exercise: Are these filters equivalent or what's the difference between the two filters?
getChildren >>> ( isXText <+> ( getChildren >>> isXText ) ) ( getChildren >>> isXText ) <+> ( getChildren >>> getChildren >>> isXText )
Of course we need choice combinators. The first idea is an if-then-else filter, built up from three simpler filters. But often it's easier and more elegant to work with simpler binary combinators for choice. So we will introduce the simpler ones first.
One of these choice combinators is calledfollows:
orElse :: XmlFilter -> XmlFilter -> XmlFilter orElse f g t | null res1 = g t | otherwise = res1 where res1 = f t
guards :: XmlFilter -> XmlFilter -> XmlFilter guards g f t | null (g t) = [] | otherwise = f t when :: XmlFilter -> XmlFilter -> XmlFilter when f g t | null (g t) = [t] | otherwise = f t
These choice operators become useful when transforming and manipulating trees.
4.4 Tree traversal filter
A very basic operation on tree structures is the traversal of all nodes and the selection and/or transformation of nodes. These traversal filters serve as control structures for processing whole trees. They correspond to the map and fold combinators for lists.
The simplest traversal filter does a top down search of all nodes with a special feature. This filter, calleddeep :: XmlFilter -> XmlFilter deep f = f `orElse` (getChildren >>> deep f)
Example: Selecting all plain text nodes of a document can be formulated with:
deep isXText
Example: Selecting all "top level" tables in a HTML documents looks like this:
deep (isElem >>> hasName "table")
multi :: XmlFilter -> XmlFilter multi f = f <+> (getChildren >>> multi f)
4.5 Arrows
We've already seen, that the filterstype XmlIOFilter = XmlTree -> IO [XmlTree]
for working in the IO monad.
Sometimes it's appropriate to thread some state through the computation like in state monads. This leads to a type like
type XmlStateFilter state = state -> XmlTree -> (state, [XmlTree])
And in real world applications we need both extensions at the same time. Of course I/O is necessary but usually there are also some global options and variables for controlling the computations. In HXT, for instance there are variables for controlling trace output, options for setting the default encoding scheme for input data and a base URI for accessing documents, which are addressed in a content or in a DTD part by relative URIs. So we need something like
type XmlIOStateFilter state = state -> XmlTree -> IO (state, [XmlTree])
A filter discussed above has all features of an arrow. Arrows are introduced for generalising the concept of functions and function combination to more general kinds of computation than pure functions.
A basic set of combinators for arrows is defined in the classes in theIn HXT there are four types instantiated with these classes for pure list arrows, list arrows with a state, list arrows with IO and list arrows with a state and IO.
newtype LA a b = LA { runLA :: (a -> [b]) } newtype SLA s a b = SLA { runSLA :: (s -> a -> (s, [b])) } newtype IOLA a b = IOLA { runIOLA :: (a -> IO [b]) } newtype IOSLA s a b = IOSLA { runIOSLA :: (s -> a -> IO (s, [b])) }
The first one and the last one are those used most frequently in the toolbox, and of course there are lifting functions for converting general arrows into more specific arrows.
Don't worry about all these conceptual details. Let's have a look into some Hello world examples.
5 Getting started: Hello world examples
5.1 copyXML
The first complete example is a program for copying an XML document
module Main where import Text.XML.HXT.Core import Text.XML.HXT.Curl -- use libcurl for HTTP access -- only necessary when reading http://... import System.Environment main :: IO () main = do [src, dst] <- getArgs runX ( readDocument [withValidate no ,withCurl [] ] src >>> writeDocument [withIndent yes ,withOutputEncoding isoLatin1 ] dst ) return ()
If only file access is necessary, this option and the import can be dropped. In that case the program does not depend on the libCurl binding.
The tree read in is piped into the output arrow. This one again is controlled by a set of system options. TheWe've omitted here the boring stuff of option parsing and error handling.
Compilation and a test run looks like this:
hobel > ghc --make -o copyXml CopyXML.hs hobel > cat hello.xml <hello><haskell>world</haskell></hello> hobel > copyXml hello.xml - <?xml version="1.0" encoding="ISO-8859-1"?> <hello> <haskell>world</haskell> </hello> hobel >
The mini XML document in file hello.xml is read and a document tree is built. Then this tree is converted into a string and written to standard output (filename: -). It is decorated with an XML declaration containing the version and the output encoding.
For processing HTML documents there is a HTML parser, which tries to parse and interpret rather anything as HTML. The HTML parser can be selected by calling
5.2 Pattern for a main program
A more realistic pattern for a simple Unix filter like program has the following structure:
module Main where import Text.XML.HXT.Core import Text.XML.HXT.... -- further HXT packages import System.IO import System.Environment import System.Console.GetOpt import System.Exit main :: IO () main = do argv <- getArgs (al, src, dst) <- cmdlineOpts argv [rc] <- runX (application al src dst) if rc >= c_err then exitWith (ExitFailure (0-1)) else exitWith ExitSuccess -- | the dummy for the boring stuff of option evaluation, -- usually done with 'System.Console.GetOpt' cmdlineOpts :: [String] -> IO (SysConfigList, String, String) cmdlineOpts argv = return ([withValidate no], argv!!0, argv!!1) -- | the main arrow application :: SysConfigList -> String -> String -> IOSArrow b Int application cfg src dst = configSysVars cfg -- (0) >>> readDocument [] src >>> processChildren (processDocumentRootElement `when` isElem) -- (1) >>> writeDocument [] dst -- (3) >>> getErrStatus -- | the dummy for the real processing: the identity filter processDocumentRootElement :: IOSArrow XmlTree XmlTree processDocumentRootElement = this -- substitute this by the real application
This program has the same functionality as our first example, but it separates the arrow from the boring option evaluation and return code computation.
In line (0) the system is configured with the list of options. These options are then used as defaults for all read and write operation. The options can be overwritten for single read/write calls by putting config options into the parameter list of the read/write function calls.
The interesing line is (1).
root node. This root node is a node above the XML document root element. The node above the XML document root element is neccessary because of possible other elements on the same tree level as the XML root, for instance comments, processing instructions or whitespace.
Furthermore the artificial root node serves for storing meta information about the document in the attribute list, like the document name, the encoding scheme, the HTTP transfer headers and other information.
To process the real XML root element, we have to take the children of the root node, select the XML root element and process this, but remain all other children unchanged. This is done with
all children of a node. All results form processing the list of children from the result node.
The structure of internal document tree can be made visible
e.g. by adding the optionThis will emit the tree in a readable text representation instead of the real document.
In the next section we will give examples for the
5.3 Tracing
There are tracing facilities to observe the actions performed and to show intermediate results
application :: SysConfigList -> String -> String -> IOSArrow b Int application cfg src dst = configSysVars (withTrace 1 : cfg) -- (0) >>> traceMsg 1 "start reading document" -- (1) >>> readDocument [] src >>> traceMsg 1 "document read, start processing" -- (2) >>> processChildren (processDocumentRootElement `when` isElem) >>> traceMsg 1 "document processed" -- (3) >>> writeDocument [] dst >>> getErrStatus
In (0) the system trace level is set to 1, in default level 0 all trace messages are suppressed. The three trace messages (1)-(3) will be issued, but also readDocument and writeDocument will log their activities.
How a whole document and the internal tree structure can be traced, is shown in the following example
... >>> processChildren (processDocumentRootElement `when` isElem) >>> withTraceLevel 4 (traceDoc "resulting document") -- (1) >>> ...
In (1) the trace level is locally set to the highest level 4. traceDoc will then issue the trace message, the document formatted as XML, and the internal DOM tree of the document.
6 Selection examples
6.1 Selecting text from an HTML document
Selecting all the plain text of an XML/HTML document can be formulated with
selectAllText :: ArrowXml a => a XmlTree XmlTree selectAllText = deep isText
In this case, where the selected nodes are all leaves, these would give the same result.
6.2 Selecting text and ALT attribute values
Let's take a bit more complex task: We want to select all text, but also the values of the alt attributes of image tags.
selectAllTextAndAltValues :: ArrowXml a => a XmlTree XmlTree selectAllTextAndAltValues = deep ( isText -- (1) <+> ( isElem >>> hasName "img" -- (2) >>> getAttrValue "alt" -- (3) >>> mkText -- (4) ) )
The whole tree is searched for text nodes (1) and for image elements (2), from the image elements the alt attribute values are selected as plain text (3), this text is transformed into a text node (4).
6.3 Selecting text and ALT attribute values (2)
Let's refine the above filter one step further. The text from the alt attributes shall be marked in the output by surrounding double square brackets. Empty alt values shall be ignored.
selectAllTextAndRealAltValues :: ArrowXml a => a XmlTree XmlTree selectAllTextAndRealAltValues = deep ( isText <+> ( isElem >>> hasName "img" >>> getAttrValue "alt" >>> isA significant -- (1) >>> arr addBrackets -- (2) >>> mkText ) ) where significant :: String -> Bool significant = not . all (`elem` " \n\r\t") addBrackets :: String -> String addBrackets s = " [[ " ++ s ++ " ]] "
This example shows two combinators for building arrows from pure functions.
The first one7 Document construction examples
7.1 The Hello World document
The first document, of course, is a Hello World document:
helloWorld :: ArrowXml a => a XmlTree XmlTree helloWorld = mkelem "html" [] -- (1) [ mkelem "head" [] [ mkelem "title" [] [ txt "Hello World" ] -- (2) ] , mkelem "body" [ sattr "class" "haskell" ] -- (3) [ mkelem "h1" [] [ txt "Hello World" ] -- (4) ] ]
To write this document to a file use the following arrow
root [] [helloWorld] -- (1) >>> writeDocument [withIndent yes] "hello.xml" -- (2)
document is wrapped into a so called root node (1). This complete document is written to "hello.xml" (2).
a whole document tree with such a root node. Before writing, the document is
indented (text nodes, and an XML declaration with version and encoding is added. If the indent option is not given, the whole document would appears on a single line:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head>
<title>Hello World</title>
</head>
<body class="haskell">
<h1>Hello World</h1>
</body>
</html>
The code can be shortened a bit by using some of the convenient functions:
helloWorld2 :: ArrowXml a => a XmlTree XmlTree helloWorld2 = selem "html" [ selem "head" [ selem "title" [ txt "Hello World" ] ] , mkelem "body" [ sattr "class" "haskell" ] [ selem "h1" [ txt "Hello World" ] ] ]
In the above two examples the arrow input is totally ignored, because
of the use of the constant arrow7.2 A page about all images within a HTML page
A bit more interesting task is the construction of a page containing a table of all images within a page inclusive image URLs, geometry and ALT attributes.
The program for this has a frame similar to thebut the rows of the table must be filled in from the input document. In the first step we will generate a table with a single column containing the URL of the image.
imageTable :: ArrowXml a => a XmlTree XmlTree imageTable = selem "html" [ selem "head" [ selem "title" [ txt "Images in Page" ] ] , selem "body" [ selem "h1" [ txt "Images in Page" ] , selem "table" [ collectImages -- (1) >>> genTableRows -- (2) ] ] ] where collectImages -- (1) = deep ( isElem >>> hasName "img" ) genTableRows -- (2) = selem "tr" [ selem "td" [ getAttrValue "src" >>> mkText ] ]
With (1) the image elements are collected, and with (2) the HTML code for an image element is built.
Applied to http://www.haskell.org/ we get the following result (at the time writing this page):
<html>
<head>
<title>Images in Page</title>
</head>
<body>
<h1>Images in Page</h1>
<table>
<tr>
<td>/haskellwiki_logo.png</td>
</tr>
<tr>
<td>/sitewiki/images/1/10/Haskelllogo-small.jpg</td>
</tr>
<tr>
<td>/haskellwiki_logo_small.png</td>
</tr>
</table>
</body>
</html>
When generating HTML, often there are constant parts within the page, in the example e.g. the page header. It's possible to write these parts as a string containing plain HTML and then read this with
a simple XML contents parser calledThe example above could then be rewritten as
imageTable
= selem "html"
[ pageHeader
, ...
]
where
pageHeader
= constA "<head><title>Images in Page</title></head>"
>>>
xread
...IO monad, so it can be used in any context, but therefore the error handling
is very limited.7.3 A page about all images within a HTML page: 1. Refinement
The next refinement step is the extension of the table such that it contains four columns, one for the image itself, one for the URL,
the geometry and the ALT text. The extendedhas the following form:
genTableRows = selem "tr" [ selem "td" -- (1) [ this -- (1.1) ] , selem "td" -- (2) [ getAttrValue "src" >>> mkText >>> mkelem "a" -- (2.1) [ attr "href" this ] [ this ] ] , selem "td" -- (3) [ ( getAttrValue "width" &&& -- (3.1) getAttrValue "height" ) >>> arr2 geometry -- (3.2) >>> mkText ] , selem "td" -- (4) [ getAttrValue "alt" >>> mkText ] ] where geometry :: String -> String -> String geometry "" "" = "" geometry w h = w ++ "x" ++ h
(2) is the column from the previous example but the URL has been made active by embedding the URL in an A-element (2.1). In (3) there are two
new combinators,geometry spec. (4) adds the ALT-text.
7.4 A page about all images within a HTML page: 2. Refinement
The generated HTML page is not yet very useful, because it usually contains relative HREFs to the images, so the links do not work. We have to transform the SRC attribute values into absolute URLs. This can be done with the following code:
imageTable2 :: IOStateArrow s XmlTree XmlTree imageTable2 = ... ... , selem "table" [ collectImages >>> mkAbsImageRef -- (1) >>> genTableRows ] ... mkAbsImageRef :: IOStateArrow s XmlTree XmlTree -- (1) mkAbsImageRef = processAttrl ( mkAbsRef -- (2) `when` hasName "src" -- (3) ) where mkAbsRef -- (4) = replaceChildren ( xshow getChildren -- (5) >>> ( mkAbsURI `orElse` this ) -- (6) >>> mkText -- (7) )
(1). This arrow uses the global system state of HXT, in which the base URL of a document is stored. For editing the SRC attribute value, the attribute list
of the image elements is processed withThe resulting String value is converted into a text node forming the new attribute value node (7).
Because of the use of the global HXT state in8 Transformation examples
8.1 Decorating external references of an HTML document
In the following examples, we want to decorate the external references in an HTML page by a small icon, like it's done in many wikis. For this task the document tree has to be traversed, all parts except the intersting A-Elements remain unchanged. At the end of the list of children of an A-Element we add an image element.
Here is the first version:
addRefIcon :: ArrowXml a => a XmlTree XmlTree addRefIcon = processTopDown -- (1) ( addImg -- (2) `when` isExternalRef -- (3) ) where isExternalRef -- (4) = isElem >>> hasName "a" >>> hasAttr "href" >>> getAttrValue "href" >>> isA isExtRef where isExtRef -- (4.1) = isPrefixOf "http:" -- or something more precise addImg = replaceChildren -- (5) ( getChildren -- (6) <+> imgElement -- (7) ) imgElement = mkelem "img" -- (8) [ sattr "src" "/icons/ref.png" -- (9) , sattr "alt" "external ref" ] [] -- (10)
This arrow applies an arrow to all nodes of the whole document tree.
The transformation arrow applies theall A-elements (3),(4). This arrow uses a bit simplified test (4.1) for external URLs.
selecting the current children (6) and adding an image element (7).
The image element is constructed withan element name, a list of arrows for computing the attributes and a list of arrows for computing the contents. The content of the image element is
empty (10). The attributes are constructed withthe name value pair of arguments.
8.2 Transform external references into absolute references
In the following example we will develop a program for editing a HTML page such that all references to external documents (images, hypertext refs, style refs, ...) become absolute references. We will see some new, but very useful combinators in the solution.
The task seems to be rather trivial. In a tree travaersal all references are edited with respect to the document base. But in HTML there is a BASE element, allowed in the content of HEAD with a HREF attribute, which defines the document base. Again this href can be a relative URL.
We start the development with the editing arrow. This gets the real document base as argument.
mkAbsHRefs :: ArrowXml a => String -> a XmlTree XmlTree mkAbsHRefs base = processTopDown editHRef -- (1) where editHRef = processAttrl -- (3) ( changeAttrValue (absHRef base) -- (5) `when` hasName "href" -- (4) ) `when` ( isElem >>> hasName "a" ) -- (2) where absHRef :: String -> String -> String -- (5) absHRef base url = fromMaybe url . expandURIString url $ base
The tree is traversed (1) and for every A element the attribute list is processed (2). All HREF attribute values (4) are manipulated
byan absolut URI. In this first step we only edit A-HREF attribute values. We will refine this later.
The second step is the complete computation of the base URL.
computeBaseRef :: IOStateArrow s XmlTree String computeBaseRef = ( ( ( isElem >>> hasName "html" -- (0) >>> getChildren -- (1) >>> isElem >>> hasName "head" -- (2) >>> getChildren -- (3) >>> isElem >>> hasName "base" -- (4) >>> getAttrValue "href" -- (5) ) &&& getBaseURI -- (6) ) >>> expandURI -- (7) ) `orElse` getBaseURI -- (8)
Input to this arrow is the HTML element, (0) to (5) is the arrow for selecting the BASE elements HREF value, parallel to this the system base URL is read
withof a BASE element. in this case we take the plain document base (8). The selection of the BASE elements is not yet very handy. We will define a more general and elegant function later, allowing an element path as selection argument.
In the third step, we will combine the to arrows. For this we will use
a new combinatoris the following: We need the arrow input (the document) two times, once for computing the document base, and second for editing the whole document, and we want to compute the extra string parameter for editing of course with the above defined arrow.
The combined arrow, our main arrow, looks like this
toAbsRefs :: IOStateArrow s XmlTree XmlTree toAbsRefs = mkAbsHRefs $< computeBaseRef -- (1)
this pattern occurs rather frequently, so ($<) becomes very useful.
Programming with arrows is one style of point free programming. Point free programming often becomes unhandy when values are used more than once.
One solution is the special arrow syntax supported by ghc and others, similar to the do notation for monads. But for many simple cases theis sufficient.
To complete the development of the example, a last step is neccessary: The removal of the redundant BASE element.
toAbsRefs :: IOStateArrow s XmlTree XmlTree toAbsRefs = ( mkAbsHRefs $< computeBaseRef ) >>> removeBaseElement removeBaseElement :: ArrowXml a => a XmlTree XmlTree removeBaseElement = processChildren ( processChildren ( none -- (1) `when` ( isElem >>> hasName "base" ) ) `when` ( isElem >>> hasName "head" ) )
In this function the children of the HEAD element are searched for
a BASE element. This is removed by aplying the null arrowto the input, returning always the empty list.
for selecting the right subtree that is rather common in HXT applications
isElem >>> hasName n1 >>> getChildren >>> isElem >>> hasName n2 ... >>> getChildren >>> isElem >>> hasName nm
For this pattern we will define a convenient function creating the arrow for selection
getDescendents :: ArrowXml a => [String] -> a XmlTree XmlTree getDescendents = foldl1 (\ x y -> x >>> getChildren >>> y) -- (1) . map (\ n -> isElem >>> hasName n) -- (2)
The name list is mapped to the element checking arrow (2),
the resulting list of arrows is folded withand becomes more readable:
computeBaseRef :: IOStateArrow s XmlTree String computeBaseRef = ( ( ( getDescendents ["html","head","base"] -- (1) >>> getAttrValue "href" -- (2) ) ... ...
An even more general and flexible technic are the XPath expressions available for selection of document parts defined in the module
computeBaseRef
= ( ( ( getXPathTrees "/html/head/base" -- (1)
>>>
getAttrValue "href" -- (2)
)
...Even the attribute selection can be expressed by XPath, so (1) and (2) can be combined into
computeBaseRef
= ( ( xshow (getXPathTrees "/html/head/base@href")
...XPath result, an XmlTree, into a string.
XPath defines a full language for selecting parts of an XML document. Sometimes it's rather comfortable to make selections of this type, but the XPath evaluation in general is more expensive in time and space than a simple combination of arrows, like we've
seen it in8.3 Transform external references into absolute references: Refinement
In the above example only A-HREF URLs are edited. Now we extend this to other element-attribute combinations.
mkAbsRefs :: ArrowXml a => String -> a XmlTree XmlTree mkAbsRefs base = processTopDown ( editRef "a" "href" -- (2) >>> editRef "img" "src" -- (3) >>> editRef "link" "href" -- (4) >>> editRef "script" "src" -- (5) ) where editRef en an -- (1) = processAttrl ( changeAttrValue (absHRef base) `when` hasName an ) `when` ( isElem >>> hasName en ) where absHRef :: String -> String -> String absHRef base url = fromMaybe url . expandURIString url $ base
The arrow applied to every element is extended to a sequence of
To process all possible HTML elements, this sequence should be extended by further element-attribute pairs.
This can further be simplified into
mkAbsRefs :: ArrowXml a => String -> a XmlTree XmlTree mkAbsRefs base = processTopDown editRefs where editRefs = foldl (>>>) this . map (\ (en, an) -> editRef en an) $ [ ("a", "href") , ("img", "src") , ("link", "href") , ("script", "src") -- and more ] editRef = ...
so the above code can be simplified to
mkAbsRefs :: ArrowXml a => String -> a XmlTree XmlTree mkAbsRefs base = processTopDown editRefs where editRefs = seqA . map (uncurry editRef) $ ...
9 More complex examples
9.1 Serialization and deserialisation to/from XML
Examples can be found in HXT/Conversion of Haskell data from/to XML
9.2 Practical examples of HXT
More complex and complete examples of HXT in action can be found in HXT/Practical
9.3 The Complete Guide To Working With HTML
Tutorial and Walkthrough: http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html
