[Haskell-cafe] getting crazy with character encoding

Andrea Rossato mailing_list at istitutocolli.org
Wed Sep 12 10:18:43 EDT 2007


supposed that, in a Linux system, in an utf-8 locale, you create a file
with non ascii characters. For instance:
touch abèèè

Now, I would expect that the output of a shell command such as 
"ls ab*"
would be a string/list of 5 chars. Instead I find it to be a list of 8

That is to say, each non ascii character is read as 2 characters, as
if the string were an ISO-8859-1 string - the string is actually
treated as an ISO-8859-1 string. But when I print it, now it is
displayed correctly.

I don't understand what's wrong and, this is worse, I don't understand
what I should be studying to understand what I'm doing wrong.

After reading about character encoding, the way the linux kernel
manages file names, I would expect that a file name set in an utf-8
locale should be read by locale aware application as an utf-8 string,
and each character a unicode code point which can be represented by a
Haskell char. What's wrong with that?

Thanks for your kind attention.


Here the code to test my problem. Before creating the file remember to
set the LANG environmental variable. Something like: 
export LANG="en_US.utf8" 
should be fine. (Check your available locales with "locale -a")

import System.Process
import System.IO
import Control.Monad

main = do
  l <- fmap lines $ runProcessWithInput "/bin/bash" [] "ls ab*"
  putStrLn (show l)
  mapM_ putStrLn l
  mapM_ (putStrLn . show . length) l

runProcessWithInput cmd args input = do
  (pin, pout, perr, ph) <- runInteractiveProcess cmd args Nothing Nothing
  hPutStr pin input
  hClose pin
  output <- hGetContents pout
  when (output==output) $ return ()
  hClose pout
  hClose perr
  waitForProcess ph
  return output

More information about the Haskell-Cafe mailing list