[Haskell-cafe] Data.Text UTF-8 question

Albert Y. C. Lai trebla at vex.net
Fri Aug 31 22:31:26 CEST 2012


On 12-08-31 01:59 AM, jeff p wrote:
> I have a sample file (attached) which I cannot read into Text:
>
>      Prelude Control.Applicative> Data.Text.IO.readFile "foo"
>      *** Exception: utf8.txt: hGetContents: invalid argument (invalid
> byte sequence)
>
>      Prelude Control.Applicative> Data.Text.Encoding.decodeUtf8 <$>
> Data.ByteString.Char8.readFile "foo"
>      "*** Exception: Cannot decode byte '\x6e':
> Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

At offsets from 0x55 to 0x5A:

0x4D 0x61 0x72 0x74 0xED 0x6E

This is clearly not UTF-8. This would be, in ISO-8859-1, "Martín".

"Martín" in UTF-8 is 0x4D 0x61 0x72 0x74 0xC3 0xAD 0x6E, and it takes 
one more byte.

And like Gregory Collins says, different UTF-8 decoders may handle 
errors differently. Some abort. Some others fill in a special character.




More information about the Haskell-Cafe mailing list