[Haskell-cafe] Strings and utf-8

Reinier Lamers reinier.lamers at phil.uu.nl
Wed Nov 28 07:34:30 EST 2007


Duncan Coutts wrote:

>On Tue, 2007-11-27 at 18:38 +0000, Paul Johnson wrote:
>  
>
>>Brandon S. Allbery KF8NH wrote:
>>    
>>
>>>However, the IO system truncates [characters] to 8 bits.
>>>      
>>>
>
>  
>
>>Should this be considered a bug?
>>    
>>
>
>A design problem.
>
>  
>
>>I presume that its because <stdio.h> was defined in the days of
>>ASCII-only strings, and the functions in System.IO are defined in
>>terms of <stdio.h>.  But does this need to be the case in the future?
>>    
>>
>
>When it's phrased as "truncates to 8 bits" it sounds so simple, surely
>all we need to do is not truncate to 8 bits right?
>
>The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? How
>would people specify that they really want to use a binary file.
>Whatever we change it'll break programs that use the existing meanings.
>
>One sensible suggestion many people have made is that H98 file IO should
>use the locale encoding and do Unicode/String <-> locale conversion. So
>that'd all be text files. Then openBinaryFile would be used for binary
>files. Of course then we'd need control over setting the encoding and
>what to do on encountering encoding errors.
>
Wouldn't it be sensible not to use the H98 file I/O operations at all 
anymore with binary files? A Char represents a Unicode code point value 
and is not the right data type to use to represent a byte from a binary 
stream. Who wants binary I/O would have to use Data.ByteString.* and 
Data.Binary.

So you would use System.IO.hPutStr to write a text string, and 
Data.ByteString.hPutStr to write a sequence of bytes. Probably, a good 
implementation of the earlier could be made in terms of the latter.

Reinier


More information about the Haskell-Cafe mailing list