Personal tools

UTF-8

From HaskellWiki

(Difference between revisions)
Jump to: navigation, search
(question - what about other string encodings?)
 
(4 intermediate revisions by one user not shown)
Line 1: Line 1:
 
[[Category:Code]]
 
[[Category:Code]]
A small example showing how to read and write UTF-8 in Haskell.
 
   
Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway)
+
The simplest solution seems to be to use the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string utf8-string package] from Galois. It
  +
provides a drop-in replacement for System.IO
   
<haskell>
+
''What about other string encodings?''
> module Main where
 
 
> import Control.Monad (mapM_)
 
> import Data.Word (Word8)
 
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
 
> import System.Environment (getArgs)
 
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile, hClose,
 
> IOMode(ReadMode, WriteMode))
 
</haskell>
 
   
We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ). Note that Pugs also has a UTF-8 library of its own, which if I believe to handle ByteStrings, but I'm sticking with this one because it's what I know.
+
== Example ==
  +
If we use a function from System.IO.UTF8, we should also hide the equivalent one from the Prelude. (Alternatively, we could import the UTF8 module qualified)
   
 
<haskell>
 
<haskell>
> import UTF8
+
> import System.IO.UTF8
  +
> import Prelude hiding (readFile, writeFile)
  +
> import System.Environment (getArgs)
 
</haskell>
 
</haskell>
   
What we want to show is that we can both read and write UTF-8. We do this by reading a file in, reversing every one of its lines, and writing it back out with the extra extension '.rev'. We'll do this for every filename that is passed in on the command line.
+
The readFile and writeFile functions are the same as before...
   
 
<haskell>
 
<haskell>
Line 22: Line 21:
 
> reverseUTF8File :: FilePath -> IO ()
 
> reverseUTF8File :: FilePath -> IO ()
 
> reverseUTF8File f =
 
> reverseUTF8File f =
> do fb <- readFileBytes f
+
> do c <- readFile f
> case decode fb of
+
> writeFile (f ++ ".rev") $ reverseLines c
> (cs, []) -> writeFileBytes (f ++ ".rev") $ encode $ reverseLines cs
 
> (_, xs) -> fail $ show xs
 
 
> where
 
> where
 
> reverseLines = unlines . map reverse . lines
 
> reverseLines = unlines . map reverse . lines
</haskell>
 
 
For this to work, we need to have some helper functions for reading and
 
writing [Word8]. It would be nice is if there were some standard functions for reading and writing [Word8] in files. (Note: I grabbed half of this off a post on one of the Haskell mailing lists)
 
 
<haskell>
 
> readFileBytes :: FilePath -> IO [Word8]
 
> readFileBytes f =
 
> do h <- openBinaryFile f ReadMode
 
> hsize <- fromIntegral `fmap` hFileSize h
 
> hGetBytes h hsize
 
>
 
> writeFileBytes :: FilePath -> [Word8] -> IO ()
 
> writeFileBytes f ws =
 
> do h <- openBinaryFile f WriteMode
 
> hPutBytes h (length ws) ws
 
> hClose h
 
 
> hGetBytes :: Handle -> Int -> IO [Word8]
 
> hGetBytes h c = allocaArray c $ \p ->
 
> do c' <- hGetBuf h p c
 
> peekArray c' p
 
>
 
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
 
> hPutBytes h c ws = allocaArray c $ \p ->
 
> do pokeArray p ws
 
> hPutBuf h p c
 
 
</haskell>
 
</haskell>

Latest revision as of 02:22, 22 July 2008


The simplest solution seems to be to use the utf8-string package from Galois. It provides a drop-in replacement for System.IO

What about other string encodings?

[edit] Example

If we use a function from System.IO.UTF8, we should also hide the equivalent one from the Prelude. (Alternatively, we could import the UTF8 module qualified)

> import System.IO.UTF8
> import Prelude hiding (readFile, writeFile)
> import System.Environment (getArgs)

The readFile and writeFile functions are the same as before...

> main :: IO ()
> main =
>  do args <- getArgs
>     mapM_ reverseUTF8File args
 
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
>   do c <- readFile f
>      writeFile (f ++ ".rev") $ reverseLines c
>   where
>     reverseLines = unlines . map reverse . lines