[Haskell-cafe] regex-pcre is not working with UTF-8

José Romildo Malaquias j.romildo at gmail.com
Wed Aug 22 16:56:58 CEST 2012


On Tue, Aug 21, 2012 at 05:50:44PM -0300, José Romildo Malaquias wrote:
> On Tue, Aug 21, 2012 at 04:05:28PM +0100, Chris Kuklewicz wrote:
> > I do not have time to test this myself right now.  But I will unravel my code a
> > bit for you.
> > 
> > > By November 2011 it worked without problems in my application. Now that
> > > I have resumed developping the application, I have been faced with this
> > > behaviour. As it used to work before, I believe it is a bug in
> > > regex-pcre or libpcre.
> > 
> > I believe it may be problem in String <-> ByteString conversion.  The "base"
> > library may have changed and your LOCALE information may be different or may be
> > being used differently by "base".
> > 
> > > The (temporary) workaround I found is to convert the strings to
> > > byte-strings before matching, and then convert the results back to
> > > strings. With byte-strings it works well.
> > 
> > That is an excellent sign that it is your LOCALE settings being picked up by
> > GHC's "base" package, see explanation below.
[...]
> I have written an application to test those things. There are 2 source
> files: test.hs and seestr.c, which are attached.
> 
> The test does the following:
> 
>    1. shows the getForeignEncoding
> 
>    2. uses a C function to show the characters from a String (using
>       withCString) and from a ByteString (using useAsCString)
> 
>    3. matches a PCRE regular expression using String and ByteString
> 
> The test is run twice, with different LANG settings, and its output
> follows.
[...]
> As can be seen, regular expression matching does not work with
> en_US.UTF-8. But it works with en_US.ISO-8859-1.
> 
> The test shows that withCString is working as expected too. This
> may suggest the problem is really with regex-pcre.

The previous tests were run on an gentoo linux with ghc-7.4.1.

I have also run the tests on Fedora 17 with ghc-7.0.4, which does not
have the bug. The sources are attached. The tests output follows:

   $ LANG=en_US.ISO-8859-1 && ./test 
   testing with String
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   testing with ByteString
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   regex            : pa�s:(.*)
   text             : pa�s:Brasil
   String match     : [["pa\237s:Brasil","Brasil"]]
   ByteString match : [["pa\237s:Brasil","Brasil"]]


   $ LANG=en_US.UTF-8 && ./test
   testing with String
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   testing with ByteString
   code:       70, char: p
   code:       61, char: a
   code: ffffffed, char: 
   code:       73, char: s
   result: 4

   regex            : país:(.*)
   text             : país:Brasil
   String match     : [["pa\237s:Brasil","Brasil"]]
   ByteString match : [["pa\237s:Brasil","Brasil"]]


Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems
that With ghc-7.0.4 withCString does not obey the UTF-8 locale and
generates a latin1 C string.

Regards,

Romildo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.hs
Type: text/x-haskell
Size: 1551 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120822/4d9bb102/attachment.hs>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seestr.c
Type: text/x-c
Size: 202 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20120822/4d9bb102/attachment.bin>


More information about the Haskell-Cafe mailing list