Proposal: Define UTF-8 to be the encoding of Haskell source files

Tue Apr 5 00:48:25 CEST 2011

Per the Haskell Prime process I would like to make an official
proposal [1].

* Proposal

The Haskell 2010 language specification states that: "Haskell uses the
Unicode character set" [2]. It does not state what encoding should be
used. This means, strictly speaking, it is not possible to reliably
exchange Haskell source files on the byte level.

I propose to make UTF-8 the only allowed encoding for Haskell source
files. Implementations must discard an initial Byte Order Mark (BOM)
if present [3].

* Pros
- Ensures that Haskell source can be reliably exchanged on the byte
  level.
- Disallows implicit ISO-8859-* encodings in source code, ensuring
  portability.
- Little or no implementation burden for compiler writers.

* Cons

- Existing code relying on a non-UTF8, locale-/implementation-specific
  encoding will need conversion. (Only relevant for Hugs-only code).

* Implementation status

** GHC
"GHC assumes that source files are ASCII or UTF-8 only, other
encodings are not recognised. However, invalid UTF-8 sequences will be
ignored in comments, so it is possible to use other encodings such as
Latin-1, as long as the non-comment source code is ASCII only." [4]