[Haskell-cafe] UTF-8 BOM

Wed Jan 5 02:41:36 CET 2011

On Tue, Jan 4, 2011 at 7:08 PM, Tony Morris <tonymorris at gmail.com> wrote:
> I am reading files with System.IO.readFile. Some of these files start
> with a UTF-8 Byte Order Marker (0xef 0xbb 0xbf). For some functions that
> process this String, this causes choking so I drop the BOM as shown
> below. This feels particularly hacky, but I am not in control of many of
> these functions (that perhaps could use ByteString with a better solution).
>
> I'm wondering if there is a better way of achieving this goal. Thanks
> for any tips.
>
>
> dropBOM ::
>  String
>  -> String
> dropBOM [] =
>  []
> dropBOM s@(x:xs) =
>  let unicodeMarker = '\65279' -- UTF-8 BOM
>  in if x == unicodeMarker then xs else s
>
> readBOMFile ::
>  FilePath
>  -> IO String
> readBOMFile p =
>  dropBOM `fmap` readFile p
>

Are you thinking that the BOM should be automatically stripped from
UTF8 text at some low level, if present?

I was thinking about it, and I was deeply conflicted about the idea.
Then I read the unicode.org BOM faq[1], and I'm still conflicted.

I'm thinking that it would be correct behavior to drop the BOM from
the start of a UTF8 stream, even at a pretty low level. The FAQ seems
to allow it as a means of identifying the stream as UTF8 (although it
isn't a reliable means of identifying a stream as UTF8).

But I'm no unicode expert.

Antoine

[1] http://unicode.org/faq/utf_bom.html

>
>
>
> --
> Tony Morris
> http://tmorris.net/
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>