CWString

John Meacham john at repetae.net
Thu Aug 28 03:52:45 EDT 2003


On Thu, Aug 28, 2003 at 05:34:09AM +0100, Glynn Clements wrote:
> John Meacham wrote:
> > > > > > In our new implementation of Data.Char.isUpper and 
> > > > friends, I made the
> > > > > > simplifying assumption that Char==wchar_t==Unicode.  With 
> > > > glibc, this
> > > > > > appears to be valid as long as (a) you set LANG to 
> > > > something other than
> > > > > > "C" or "POSIX", and (b) you call setlocale() first.
> > > > > The glibc Info file says:
> > > > > 	The wide character character set always is UCS4, at least on
> > > > > 	GNU systems.
> > > > yes. with glibc, wchar_t is always unicode no matter what the locale.
> > > > better yet, all ISO C implementations  define a handy C symbol to test
> > > > for this. if __STDC_ISO_10646__ is defined then wchar_t is always
> > > > unicode no matter what.
> > > 
> > > Sure, but as I've been saying, the implementation of glibc doesn't do
> > > this.  In the C or POSIX locale, the ctype macros only recognise ASCII.
> >  
> > > Should this be considered a bug in glibc?
> > 
> > hmm.. how odd. I would consider it a bug, I think. I don't have a copy
> > of the ISO spec handy but will be sure to look up whether that is
> > conforming... It is certainly a malfeature if it is not a bug...
> 
> It certainly isn't a violation of ANSI/ISO C; that simply states that
> "The behavior of these functions is affected by the LC_CTYPE category
> of the current locale". It's perfectly legal for the implementation to
> use different wide encodings depending upon the locale.

no, glibc #defines __STDC_ISO_10646__ so wchar_t's are guarenteed to
hold UCS4 values always independent of locale. the LC_CTYPE only affects
what multibyte encoding is used. What was curious was that the character
classification routines changed behavior based on LC_CTYPE (despite the
encoding still being UCS4)

this might make sense for the classification routines dealing with upper
and lower case actually, since I believe that that might depend on the
language you are expressing.  however, other character classification
routines (such as wcwidth) should not depend on the current locale. 

it is unclear what the correct thing for an haskell implementation to
do. possibilities are:
1) determine some locale independent semantics for the classification
functions and implement that
2) guarentee the validity of character classification routines only when
the character is representable in the current locale
3) link against another library such as libunicode which provides its
own classification routines (this could be done optionally at compile
time...)

split the classification routines into locale dependent and independent
ones, guarentee the locale independent ones will always work and one of
the two above solutions for the rest...

In any case, solution 2 seems to be what we have now, which is probably
an okay interim solution as
long as we add a isRepresentable to determine if a Char can be expressed
in the current locale and whether we can trust the cclasification
functions... I have an implementation of one in the CWString library I
posted earlier...

in any case, anything is better than the current 'ignore the locale'
situation :)

        John




-- 
---------------------------------------------------------------------------
John Meacham - California Institute of Technology, Alum. - john at foo.net
---------------------------------------------------------------------------



More information about the FFI mailing list