[xmonad] spawn functions are not unicode safe

Khudyakov Alexey alexey.skladnoy at gmail.com
Thu Jan 15 11:04:04 EST 2009


On Thursday 15 January 2009 16:53:49 Roman Cheplyaka wrote:
> RFC 3629 [1] states:
>
>    o  UTF-8 strings can be fairly reliably recognized as such by a
>       simple algorithm, i.e., the probability that a string of
>       characters in any other encoding appears as valid UTF-8 is low,
>       diminishing with increasing string length.
>
> However, no references to the algorithm itself are given.
>
> Google brought me this sample algorithm [2].
> Probably it's worth to implement something like that and include into
> utf8-string if it's not already there.
>
>   1. http://www.ietf.org/rfc/rfc3629.txt
>   2. http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html

Something like this? (code below) Algorithm is trivial — check for impossible 
bytes combinations. If there is no such bytes, pairs etc. byte sequence is 
probably UTF8 encoded string.

But problem not with decoding unicode strings i.e. not with functions like 
fromUnicode :: [Word8] -> [Char]
but with encoding of string. Char represent unicode symbol, and thus 
everything OK at this point. However unix system calls know nothing about 
unicode and accept (char*) or [Word8] in haskell terminology. 

And conversion from [Char] to [Word8] is problem. It arise whenever haskell 
need to pass some string to outside world.  Currently Char simply truncated 
to one byte regardless of its value. Its because of that `encode' function is 
needed. Not only executeFile affected.

> import Control.Monad
> import Data.Word
> import Data.Bits
> import Data.Maybe
> 
> is11,is10,is0x :: Word8 -> Bool
> is11 b = (b `shiftR` 6) == 3
> is10 b = (b `shiftR` 6) == 2
> is0x b = b < 128
> 
> -- Test if pair allowed in UTF8 encoded string. 
> validPair :: Word8 -> Word8 -> Maybe Word8 
> validPair a b = if (b < 254) && not ((is0x a && is10 b) ||
>                                      (is11 a && (not $ is10 b)))
>                 then Just b
>                 else Nothing
> 
> -- Check if sequence of bytes UTF8 encoded string. Note that this
> -- check is probabilistic. If function returns False this string is
> -- not UTF8. If it return True string still may fail to decode.
> isUTF8 :: [Word8] -> Bool
> isUTF8 = isJust . foldM validPair 0
> 


More information about the xmonad mailing list