[Haskell-beginners] hGetContents, unicode and linux

Michael Snoyman michael at snoyman.com
Sun Nov 28 08:53:47 EST 2010


On Sun, Nov 28, 2010 at 10:35 AM, Yitzchak Gale <gale at sefer.org> wrote:
> I wrote:
>>> In any case, you still need to have the correct encoding
>>> set on the handles as before.
>
> Michael Snoyman wrote:
>> ...it does *not* address invalid byte sequences (AFAIK),
>> which can be dealt with using the bytestring/text decoding
>> combination.
>
> Well, using the standard interface, you have three choices
> on how to handle invalid byte sequences - drop them,
> use a replacement character, or throw an exception, with
> the third choice being the default. You specify that choice
> when you set the encoding. See the documentation for
> System.IO for more details.
>
> However, those choices are implemented via GNU iconv,
> so on Windows you only have the default behavior.
>
> Also, in certain special situations - like if you need to be able
> to specify the replacement character yourself, or if you need
> in-band exceptions (e.g. a stream of Either error character),
> then the options do seem limited currently.
>
> You might still need to fall back on the old bytestring hack
> in those cases. If you find yourself in that situation, it might
> be a good idea to push the maintainers of System.IO and
> Data.Text to continue to improve support for encodings in the
> standard libraries.

I hadn't realized that the standard libraries offered so much
sophistication in their approach to file encodings, I'll have to look
at it more thoroughly.

Michael


More information about the Beginners mailing list