[xmonad] spawn functions are not unicode safe

Gwern Branwen gwern0 at gmail.com
Thu Jan 15 12:04:11 EST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Thu, Jan 15, 2009 at 11:04 AM, Khudyakov Alexey  wrote:
> On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
>> RFC 3629 [1] states:
>>
>>    o  UTF-8 strings can be fairly reliably recognized as such by a
>>       simple algorithm, i.e., the probability that a string of
>>       characters in any other encoding appears as valid UTF-8 is low,
>>       diminishing with increasing string length.
>>
>> However, no references to the algorithm itself are given.
>>
>> Google brought me this sample algorithm [2].
>> Probably it's worth to implement something like that and include into
>> utf8-string if it's not already there.
>>
>>   1. http://www.ietf.org/rfc/rfc3629.txt
>>   2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
>
> Something like this? (code below) Algorithm is trivial — check for impossible
> bytes combinations. If there is no such bytes, pairs etc. byte sequence is
> probably UTF8 encoded string.
>
> But problem not with decoding unicode strings i.e. not with functions like
> fromUnicode :: [Word8] -> [Char]
> but with encoding of string. Char represent unicode symbol, and thus
> everything OK at this point. However unix system calls know nothing about
> unicode and accept (char*) or [Word8] in haskell terminology.
>
> And conversion from [Char] to [Word8] is problem. It arise whenever haskell
> need to pass some string to outside world.  Currently Char simply truncated
> to one byte regardless of its value. Its because of that `encode' function is
> needed. Not only executeFile affected.
>
>> import Control.Monad
>> import Data.Word
>> import Data.Bits
>> import Data.Maybe
>>
>> is11,is10,is0x :: Word8 -> Bool
>> is11 b = (b `shiftR` 6) == 3
>> is10 b = (b `shiftR` 6) == 2
>> is0x b = b >
>> -- Test if pair allowed in UTF8 encoded string.
>> validPair :: Word8 -> Word8 -> Maybe Word8
>> validPair a b = if (b >                                      (is11
a && (not $ is10 b)))
>>                 then Just b
>>                 else Nothing
>>
>> -- Check if sequence of bytes UTF8 encoded string. Note that this
>> -- check is probabilistic. If function returns False this string is
>> -- not UTF8. If it return True string still may fail to decode.
>> isUTF8 :: [Word8] -> Bool
>> isUTF8 = isJust . foldM validPair 0
>>

Perhaps we're over-thinking all this. Is it a problem in any way to
run encodeString over a String that is just normal ASCII (that is, no
funky Unicode)?

Eric: could we just mindlessly call encodeString on everything going
into spawn/safeSpawn?

- --
gwern
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEAREKAAYFAklvbIoACgkQvpDo5Pfl1oIGOACfQoSjID/uj/UqFLcFrnAd1m1X
nWIAnRkfzdTP70bhKB5eMM37/E4EryH4
=4no0
-----END PGP SIGNATURE-----


More information about the xmonad mailing list