Personal tools

UTF-8

From HaskellWiki

(Difference between revisions)
Jump to: navigation, search
m
Current revision (02:22, 22 July 2008) (edit) (undo)
(question - what about other string encodings?)
 
(9 intermediate revisions not shown.)
Line 1: Line 1:
[[Category:Code]]
[[Category:Code]]
-
A small example showing how to read and write UTF-8 in Haskell.
 
-
Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway)
+
The simplest solution seems to be to use the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string utf8-string package] from Galois. It
 +
provides a drop-in replacement for System.IO
-
<haskell>
+
''What about other string encodings?''
-
> module Main where
+
-
 
+
-
> import Control.Monad (mapM_)
+
-
> import Data.Word (Word8)
+
-
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
+
-
> import System.Environment (getArgs)
+
-
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile,
+
-
> IOMode(ReadMode, WriteMode))
+
-
</haskell>
+
-
We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ). Note that Pugs also has a UTF-8 library of its own, which if I believe to handle ByteStrings.
+
== Example ==
 +
If we use a function from System.IO.UTF8, we should also hide the equivalent one from the Prelude. (Alternatively, we could import the UTF8 module qualified)
<haskell>
<haskell>
-
> import UTF8
+
> import System.IO.UTF8
 +
> import Prelude hiding (readFile, writeFile)
 +
> import System.Environment (getArgs)
</haskell>
</haskell>
-
We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its
+
The readFile and writeFile functions are the same as before...
-
lines, and writing it back out with the extension '.reversed'
+
<haskell>
<haskell>
Line 32: Line 25:
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
> reverseUTF8File f =
-
> do fb <- readFileBytes f
+
> do c <- readFile f
-
> case decode fb of
+
> writeFile (f ++ ".rev") $ reverseLines c
-
> (cs, []) -> writeFileBytes (f ++ ".reverse") $ encode $ reverseLines cs
+
-
> (_, xs) -> fail $ show xs
+
> where
> where
> reverseLines = unlines . map reverse . lines
> reverseLines = unlines . map reverse . lines
-
</haskell>
 
-
 
-
For this to work, we need to have some helper functions for reading and
 
-
writing [Word8]. It would be nice is if there were some standard functions for reading and writing [Word8] in files.
 
-
 
-
<haskell>
 
-
> readFileBytes :: FilePath -> IO [Word8]
 
-
> readFileBytes f =
 
-
> do h <- openBinaryFile f ReadMode
 
-
> hsize <- fromIntegral `fmap` hFileSize h
 
-
> hGetBytes h hsize
 
-
>
 
-
> writeFileBytes :: FilePath -> [Word8] -> IO ()
 
-
> writeFileBytes f ws =
 
-
> do h <- openBinaryFile f WriteMode
 
-
> hPutBytes h (length ws) ws
 
-
 
-
> hGetBytes :: Handle -> Int -> IO [Word8]
 
-
> hGetBytes h c = allocaArray c $ \p ->
 
-
> do c' <- hGetBuf h p c
 
-
> peekArray c' p
 
-
>
 
-
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
 
-
> hPutBytes h c ws = allocaArray c $ \p ->
 
-
> do pokeArray p ws
 
-
> hPutBuf h p c
 
</haskell>
</haskell>

Current revision


The simplest solution seems to be to use the utf8-string package from Galois. It provides a drop-in replacement for System.IO

What about other string encodings?

Example

If we use a function from System.IO.UTF8, we should also hide the equivalent one from the Prelude. (Alternatively, we could import the UTF8 module qualified)

> import System.IO.UTF8
> import Prelude hiding (readFile, writeFile)
> import System.Environment (getArgs)

The readFile and writeFile functions are the same as before...

> main :: IO ()
> main =
>  do args <- getArgs
>     mapM_ reverseUTF8File args
 
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
>   do c <- readFile f
>      writeFile (f ++ ".rev") $ reverseLines c
>   where
>     reverseLines = unlines . map reverse . lines