[Haskell-cafe] character encoding problems with hxt (I think)

nadine.and.henry at pobox.com nadine.and.henry at pobox.com
Sat Oct 13 16:37:05 CEST 2012

Dear Haskellers,

I'm trying to write some code that grabs countries and provinces from the
iso_3166 files on Linux systems.  I seem to be running into some kind of
character encoding problem.  file says iso_3166_2.xml is a utf8 file, and
isutf8 agrees, but when I run the following code, it crashes.

uft8Copy makes a byte for byte copy as expected.
noCrash read and writes the document without crashing, but the accented
characters in the strings show up garbled.  Just search for "DE" and you'll
see what I mean.  crash (on my system, (Debian testing)) produces the error
message below.

Can anyone enlighten me on what is going on?

Thanks in advance.
Henry Laxen

{-# LANGUAGE Arrows #-}
import Text.XML.HXT.Core
import Data.List
import qualified System.IO.UTF8 as U

isoFile = "/usr/share/xml/iso-codes/iso_3166_2.xml"

countZerosInLines = length . filter (\x -> x == '0') . concat

utf8Copy = do
  xml <- U.readFile isoFile
  let doc = readString [withValidate no] xml
  U.writeFile "iso_written.xml" xml
  putStrLn "Ran utf8Copy" 

noCrash ::  IO ()
noCrash = do
  lines <- runX (readDocument [withValidate no] isoFile  >>>
            writeDocumentToString [withShowTree yes] )
  -- mapM_ putStrLn lines            
  print $ countZerosInLines lines
  putStrLn "Ran noCrash" 

crash :: IO ()
crash = do
  xml <- U.readFile isoFile
  let doc = readString [withValidate no] xml
  src <- runX $ doc //> writeDocumentToString [withShowTree yes]
  print $ countZerosInLines src
  putStrLn "Ran crash (didn't)"

Ran utf8Copy
Ran noCrash

error: UTF-8 encoding error at input position 2640: InvalidLaterByte 1

error: UTF-8 encoding error at input position 2646: InvalidLaterByte 1
*** Exception: Enum.toEnum{Word8}: tag (363) is outside of bounds (0,255)

main = do

More information about the Haskell-Cafe mailing list