[Haskell-cafe] Core packages and locale support

Fri Jun 25 17:05:47 EDT 2010

* Jason Dagit <dagit at codersbase.com> [2010-06-25 10:09:21-0700]
> On Thu, Jun 24, 2010 at 11:42 PM, Roman Cheplyaka <roma at ro-che.info> wrote:
> 
> > * Jason Dagit <dagit at codersbase.com> [2010-06-24 20:52:03-0700]
> > > On Sat, Jun 19, 2010 at 1:06 AM, Roman Cheplyaka <roma at ro-che.info>
> > wrote:
> > >
> > > > While ghc 6.12 finally has proper locale support, core packages (such
> > as
> > > > unix) still use withCString and therefore work incorrectly when
> > argument
> > > > (e.g. file path) is not ASCII.
> > > >
> > >
> > > Pardon me if I'm misunderstanding withCString, but my understanding of
> > unix
> > > paths is that they are to be treated as strings of bytes.  That is,
> > unlike
> > > windows, they do not have an encoding predefined.  Furthermore, you could
> > > have two filepaths in the same directory with different encodings due to
> > > this.
> > >
> > > In this case, what would be the correct way of handling the paths?
> > >  Converting to a Haskell String would require knowing the encoding,
> > right?
> > >  My reasoning is that Haskell Char type is meant to correspond to code
> > > points so putting them into a string means you have to know their code
> > point
> > > which is different from their (multi-)byte value right?
> > >
> > > Perhaps I have some details wrong?  If so, please clarify.
> >
> > Jason,
> >
> > you got everything right here. So, as you said, there is a mismatch
> > between representation in Haskell (list of code points) and
> > representation in the operating system (list of bytes), so we need to
> > know the encoding. Encoding is supplied by the user via locale
> > (https://secure.wikimedia.org/wikipedia/en/wiki/Locale), particularly
> > LC_CTYPE variable.
> >
> > The problem with encodings is not new -- it was already solved e.g. for
> > input/output.
> >
> 
> This is the part where I don't understand the problem well.  I thought that
> with IO the program assumes the locale of the environment but that with
> filepaths you don't know what locale (more specifically which encoding) they
> were created with.  So if you try to treat them as having the locale of the
> current environment you run the risk of misunderstanding their encoding.

Sure you do. But there is no other source of encoding information apart
from the current locale. So UNIX (currently) puts the responsibility on
the user.

It's hard to give convincing examples demonstrating this semantics
because UNIX userspace is mostly written in C and there char is just a
byte, so most of them don't bother with encoding and decoding.

Difference between IO and filenames is vague -- what if you pipe ls(1)
to some program? Since ls does no recoding, encoding filenames
differently from locale is a bad idea.

By the way, GTK (which internally uses UTF-8 for strings) treats this
problem differently -- it has special variable G_FILENAME_ENCODING and
also G_BROKEN_FILENAMES (which means that filenames are encoded as
locale says). I have no clue how their G_* variables are better than our
conventional LC_* variables though.
http://www.gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html

-- 
Roman I. Cheplyaka :: http://ro-che.info/
"Don't let school get in the way of your education." - Mark Twain