[Haskell-cafe] Encoding of Haskell source files

Michael Snoyman michael at snoyman.com
Mon Apr 4 12:22:02 CEST 2011


Firstly, I personally would love to insist on using UTF-8 and be done with
it. I see no reason to bother with other character encodings.

2011/4/4 Roel van Dijk <vandijk.roel at gmail.com>

> 2011/4/4 Colin Adams <colinpauladams at googlemail.com>:
> > Not from looking with your eyes perhaps. Does that matter? Your text
> editor,
> > and the compiler, can surely figure it out for themselves.
> I am not aware of any algorithm that can reliably infer the character
> encoding used by just looking at the raw data. Why would people bother
> with stuff like <?xml version="1.0" encoding="UTF-8"?> if
> automatically figuring out the encoding was easy?
>
> There *is* an algorithm for determining the encoding of an XML file based
on a combination of the BOM (Byte Order Marker) and an assumption that the
file will start with a XML declaration (i.e., <?xml ... ?>). But this isn't
capable of determining every possible encoding on the planet, just
distinguishing amongst varieties of UTF-(8|16|32)/(big|little) endian and
EBCIDC. It cannot tell the difference between UTF-8, Latin-1, and
Windows-1255 (Hebrew), for example.


> > There aren't many Unicode encoding formats
> From casually scanning some articles about encodings I can count at
> least 70 character encodings [1].
>
> I think the implication of "Unicode encoding formats" is something in the
UTF family. An encoding like Latin-1 or Windows-1255 can be losslessly
translated into Unicode codepoints, but it's not exactly an encoding of
Unicode, but rather a subset of Unicode.


> > and there aren't very many possibilities for the
> > leading characters of a Haskell source file, are there?
> Since a Haskell program is a sequence of Unicode code points the
> programmer can choose from up to 1,112,064 characters. Many of these
> can legitimately be part of the interface of a module, as function
> names, operators or names of types.
>
> My guess is that a large subset of Haskell modules start with one of left
brace (starting with comment or language pragma), m (for starting with
module), or some whitespace character. So it *might* be feasible to take a
guess at things. But as I said before: I like UTF-8. Is there anyone out
there who has a compelling reason for writing their Haskell source in
EBCDIC?

Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110404/d0748519/attachment.htm>


More information about the Haskell-Cafe mailing list