[xmonad] spawn functions are not unicode safe

Gwern Branwen gwern0 at gmail.com
Thu Jan 15 12:04:11 EST 2009

Hash: SHA512

On Thu, Jan 15, 2009 at 11:04 AM, Khudyakov Alexey  wrote:
> On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
>> RFC 3629 [1] states:
>>    o  UTF-8 strings can be fairly reliably recognized as such by a
>>       simple algorithm, i.e., the probability that a string of
>>       characters in any other encoding appears as valid UTF-8 is low,
>>       diminishing with increasing string length.
>> However, no references to the algorithm itself are given.
>> Google brought me this sample algorithm [2].
>> Probably it's worth to implement something like that and include into
>> utf8-string if it's not already there.
>>   1. http://www.ietf.org/rfc/rfc3629.txt
>>   2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html
> Something like this? (code below) Algorithm is trivial — check for impossible
> bytes combinations. If there is no such bytes, pairs etc. byte sequence is
> probably UTF8 encoded string.
> But problem not with decoding unicode strings i.e. not with functions like
> fromUnicode :: [Word8] -> [Char]
> but with encoding of string. Char represent unicode symbol, and thus
> everything OK at this point. However unix system calls know nothing about
> unicode and accept (char*) or [Word8] in haskell terminology.
> And conversion from [Char] to [Word8] is problem. It arise whenever haskell
> need to pass some string to outside world.  Currently Char simply truncated
> to one byte regardless of its value. Its because of that `encode' function is
> needed. Not only executeFile affected.
>> import Control.Monad
>> import Data.Word
>> import Data.Bits
>> import Data.Maybe
>> is11,is10,is0x :: Word8 -> Bool
>> is11 b = (b `shiftR` 6) == 3
>> is10 b = (b `shiftR` 6) == 2
>> is0x b = b >
>> -- Test if pair allowed in UTF8 encoded string.
>> validPair :: Word8 -> Word8 -> Maybe Word8
>> validPair a b = if (b >                                      (is11
a && (not $ is10 b)))
>>                 then Just b
>>                 else Nothing
>> -- Check if sequence of bytes UTF8 encoded string. Note that this
>> -- check is probabilistic. If function returns False this string is
>> -- not UTF8. If it return True string still may fail to decode.
>> isUTF8 :: [Word8] -> Bool
>> isUTF8 = isJust . foldM validPair 0

Perhaps we're over-thinking all this. Is it a problem in any way to
run encodeString over a String that is just normal ASCII (that is, no
funky Unicode)?

Eric: could we just mindlessly call encodeString on everything going
into spawn/safeSpawn?

- --
Version: GnuPG v1.4.9 (GNU/Linux)


More information about the xmonad mailing list