patch applied (packages/base): Rewrite of the IO library, including Unicode support

Simon Marlow marlowsd at gmail.com
Fri Jun 12 11:32:56 EDT 2009


Fri Jun 12 06:56:31 PDT 2009  Simon Marlow <marlowsd at gmail.com>
  * Rewrite of the IO library, including Unicode support
  Ignore-this: fbd43ec854ac5df442e7bf647de8ca5a
  
  Highlights:
  
  * Unicode support for Handle I/O:
  
    ** Automatic encoding and decoding using a per-Handle encoding.
  
    ** The encoding defaults to the locale encoding (only on Unix 
       so far, perhaps Windows later).
  
    ** Built-in UTF-8, UTF-16 (BE/LE), and UTF-32 (BE/LE) codecs.
  
    ** iconv-based codec for other encodings on Unix
  
  * Modularity: the low-level IO interface is exposed as a type class
    (GHC.IO.IODevice) so you can build your own low-level IO providers and
    make Handles from them.
  
  * Newline translation: instead of being Windows-specific wired-in
    magic, the translation from \r\n -> \n and back again is available
    on all platforms and is configurable for reading/writing
    independently.
  
  
  Unicode-aware Handles
  ~~~~~~~~~~~~~~~~~~~~~
  
  This is a significant restructuring of the Handle implementation with
  the primary goal of supporting Unicode character encodings.
  
  The only change to the existing behaviour is that by default, text IO
  is done in the prevailing locale encoding of the system (except on
  Windows [1]).  
  
  Handles created by openBinaryFile use the Latin-1 encoding, as do
  Handles placed in binary mode using hSetBinaryMode.
  
  We provide a way to change the encoding for an existing Handle:
  
     GHC.IO.Handle.hSetEncoding :: Handle -> TextEncoding -> IO ()
  
  and various encodings (from GHC.IO.Encoding):
  
     latin1,
     utf8,
     utf16, utf16le, utf16be,
     utf32, utf32le, utf32be,
     localeEncoding,
  
  and a way to lookup other encodings:
  
     GHC.IO.Encoding.mkTextEncoding :: String -> IO TextEncoding
  
  (it's system-dependent whether the requested encoding will be
  available).
  
  We may want to export these from somewhere more permanent; that's a
  topic for a future library proposal.
  
  Thanks to suggestions from Duncan Coutts, it's possible to call
  hSetEncoding even on buffered read Handles, and the right thing
  happens.  So we can read from text streams that include multiple
  encodings, such as an HTTP response or email message, without having
  to turn buffering off (though there is a penalty for switching
  encodings on a buffered Handle, as the IO system has to do some
  re-decoding to figure out where it should start reading from again).
  
  If there is a decoding error, it is reported when an attempt is made
  to read the offending character from the Handle, as you would expect.
  
  Performance varies.  For "hGetContents >>= putStr" I found the new
  library was faster on my x86_64 machine, but slower on an x86.  On the
  whole I'd expect things to be a bit slower due to the extra
  decoding/encoding, but probabaly not noticeably.  If performance is
  critical for your app, then you should be using bytestring and text
  anyway.
  
  [1] Note: locale encoding is not currently implemented on Windows due
  to the built-in Win32 APIs for encoding/decoding not being sufficient
  for our purposes.  Ask me for details.  Offers of help gratefully
  accepted.
  
  
  Newline Translation
  ~~~~~~~~~~~~~~~~~~~
  
  In the old IO library, text-mode Handles on Windows had automatic
  translation from \r\n -> \n on input, and the opposite on output.  It
  was implemented using the underlying CRT functions, which meant that
  there were certain odd restrictions, such as read/write text handles
  needing to be unbuffered, and seeking not working at all on text
  Handles.
  
  In the rewrite, newline translation is now implemented in the upper
  layers, as it needs to be since we have to perform Unicode decoding
  before newline translation.  This means that it is now available on
  all platforms, which can be quite handy for writing portable code.
  
  For now, I have left the behaviour as it was, namely \r\n -> \n on
  Windows, and no translation on Unix.  However, another reasonable
  default (similar to what Python does) would be to do \r\n -> \n on
  input, and convert to the platform-native representation (either \r\n
  or \n) on output.  This is called universalNewlineMode (below).
  
  The API is as follows.  (available from GHC.IO.Handle for now, again
  this is something we will probably want to try to get into System.IO
  at some point):
  
  -- | The representation of a newline in the external file or stream.
  data Newline = LF    -- ^ "\n"
               | CRLF  -- ^ "\r\n"
               deriving Eq
  
  -- | Specifies the translation, if any, of newline characters between
  -- internal Strings and the external file or stream.  Haskell Strings
  -- are assumed to represent newlines with the '\n' character; the
  -- newline mode specifies how to translate '\n' on output, and what to
  -- translate into '\n' on input.
  data NewlineMode 
    = NewlineMode { inputNL :: Newline,
                      -- ^ the representation of newlines on input
                    outputNL :: Newline
                      -- ^ the representation of newlines on output
                   }
               deriving Eq
  
  -- | The native newline representation for the current platform
  nativeNewline :: Newline
  
  -- | Map "\r\n" into "\n" on input, and "\n" to the native newline
  -- represetnation on output.  This mode can be used on any platform, and
  -- works with text files using any newline convention.  The downside is
  -- that @readFile a >>= writeFile b@ might yield a different file.
  universalNewlineMode :: NewlineMode
  universalNewlineMode  = NewlineMode { inputNL  = CRLF, 
                                        outputNL = nativeNewline }
  
  -- | Use the native newline representation on both input and output
  nativeNewlineMode    :: NewlineMode
  nativeNewlineMode     = NewlineMode { inputNL  = nativeNewline, 
                                        outputNL = nativeNewline }
  
  -- | Do no newline translation at all.
  noNewlineTranslation :: NewlineMode
  noNewlineTranslation  = NewlineMode { inputNL  = LF, outputNL = LF }
  
  
  -- | Change the newline translation mode on the Handle.
  hSetNewlineMode :: Handle -> NewlineMode -> IO ()
  
  
  
  IO Devices
  ~~~~~~~~~~
  
  The major change here is that the implementation of the Handle
  operations is separated from the underlying IO device, using type
  classes.  File descriptors are just one IO provider; I have also
  implemented memory-mapped files (good for random-access read/write)
  and a Handle that pipes output to a Chan (useful for testing code that
  writes to a Handle).  New kinds of Handle can be implemented outside
  the base package, for instance someone could write bytestringToHandle.
  A Handle is made using mkFileHandle:
  
  -- | makes a new 'Handle'
  mkFileHandle :: (IODevice dev, BufferedIO dev, Typeable dev)
                => dev -- ^ the underlying IO device, which must support
                       -- 'IODevice', 'BufferedIO' and 'Typeable'
                -> FilePath
                       -- ^ a string describing the 'Handle', e.g. the file
                       -- path for a file.  Used in error messages.
                -> IOMode
                       -- ^ The mode in which the 'Handle' is to be used
                -> Maybe TextEncoding
                       -- ^ text encoding to use, if any
                -> NewlineMode
                       -- ^ newline translation mode
                -> IO Handle
  
  This also means that someone can write a completely new IO
  implementation on Windows based on native Win32 HANDLEs, and
  distribute it as a separate package (I really hope somebody does
  this!).
  
  This restructuring isn't as radical as previous designs.  I haven't
  made any attempt to make a separate binary I/O layer, for example
  (although hGetBuf/hPutBuf do bypass the text encoding and newline
  translation).  The main goal here was to get Unicode support in, and
  to allow others to experiment with making new kinds of Handle.  We
  could split up the layers further later.
  
  
  API changes and Module structure
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  NB. GHC.IOBase and GHC.Handle are now DEPRECATED (they are still
  present, but are just re-exporting things from other modules now).
  For 6.12 we'll want to bump base to version 5 and add a base4-compat.
  For now I'm using #if __GLASGOW_HASKEL__ >= 611 to avoid deprecated
  warnings.
  
  I split modules into smaller parts in many places.  For example, we
  now have GHC.IORef, GHC.MVar and GHC.IOArray containing the
  implementations of IORef, MVar and IOArray respectively.  This was
  necessary for untangling dependencies, but it also makes things easier
  to follow.
  
  The new module structurue for the IO-relatied parts of the base
  package is:
  
  GHC.IO
     Implementation of the IO monad; unsafe*; throw/catch
  
  GHC.IO.IOMode
     The IOMode type
  
  GHC.IO.Buffer
     Buffers and operations on them
  
  GHC.IO.Device
     The IODevice and RawIO classes.
  
  GHC.IO.BufferedIO
     The BufferedIO class.
  
  GHC.IO.FD
     The FD type, with instances of IODevice, RawIO and BufferedIO.
  
  GHC.IO.Exception
     IO-related Exceptions
  
  GHC.IO.Encoding
     The TextEncoding type; built-in TextEncodings; mkTextEncoding
  
  GHC.IO.Encoding.Types
  GHC.IO.Encoding.Iconv
  GHC.IO.Encoding.Latin1
  GHC.IO.Encoding.UTF8
  GHC.IO.Encoding.UTF16
  GHC.IO.Encoding.UTF32
     Implementation internals for GHC.IO.Encoding
  
  GHC.IO.Handle
     The main API for GHC's Handle implementation, provides all the Handle
     operations + mkFileHandle + hSetEncoding.
  
  GHC.IO.Handle.Types
  GHC.IO.Handle.Internals
  GHC.IO.Handle.Text
     Implementation of Handles and operations.
  
  GHC.IO.Handle.FD
     Parts of the Handle API implemented by file-descriptors: openFile,
     stdin, stdout, stderr, fdToHandle etc.
  

     ./GHC/Handle.hs -> ./GHC/IO/Handle/Internals.hs
     ./GHC/Handle.hs-boot -> ./GHC/IO/Handle.hs-boot
     ./GHC/IO.hs -> ./GHC/IO/Handle/Text.hs
     ./GHC/IOBase.lhs -> ./GHC/IO.hs
    M ./Control/Concurrent.hs -4 +2
    M ./Control/Concurrent/MVar.hs -1 +1
    M ./Control/Exception.hs -1 +1
    M ./Control/Exception/Base.hs -3 +4
    M ./Control/Monad/ST.hs -1 +1
    M ./Control/OldException.hs -3 +5
    M ./Data/HashTable.hs -3 +3
    M ./Data/IORef.hs -2 +4
    M ./Data/Typeable.hs -3 +7
    M ./Foreign/C/Error.hs -1 +3
    M ./Foreign/C/String.hs -1 +1
    M ./Foreign/C/Types.hs -1 +1
    M ./Foreign/Concurrent.hs -1 +1
    M ./Foreign/ForeignPtr.hs -1 +1
    M ./Foreign/Marshal/Alloc.hs -1 +2
    M ./Foreign/Marshal/Array.hs -1 +1
    M ./Foreign/Marshal/Error.hs -1 +2
    M ./Foreign/Marshal/Pool.hs -2 +2
    M ./Foreign/Marshal/Utils.hs -1 +1
    M ./Foreign/Ptr.hs -1 +1
    M ./Foreign/Storable.hs -1 +1
    R ./Foreign/Storable.hs-boot
    M ./GHC/Conc.lhs -130 +18
    M ./GHC/ConsoleHandler.hs -14 +14
    M ./GHC/ForeignPtr.hs -1 +2
    A ./GHC/Handle.hs
    A ./GHC/IO/
    M ./GHC/IO.hs -689 +35
    A ./GHC/IO/Buffer.hs
    A ./GHC/IO/BufferedIO.hs
    A ./GHC/IO/Device.hs
    A ./GHC/IO/Encoding/
    A ./GHC/IO/Encoding.hs
    A ./GHC/IO/Encoding/Iconv.hs
    A ./GHC/IO/Encoding/Latin1.hs
    A ./GHC/IO/Encoding/Types.hs
    A ./GHC/IO/Encoding/UTF16.hs
    A ./GHC/IO/Encoding/UTF32.hs
    A ./GHC/IO/Encoding/UTF8.hs
    A ./GHC/IO/Exception.hs
    A ./GHC/IO/Exception.hs-boot
    A ./GHC/IO/FD.hs
    A ./GHC/IO/Handle/
    A ./GHC/IO/Handle.hs
    M ./GHC/IO/Handle.hs-boot -4 +3
    A ./GHC/IO/Handle/FD.hs
    A ./GHC/IO/Handle/FD.hs-boot
    M ./GHC/IO/Handle/Internals.hs -1484 +434
    M ./GHC/IO/Handle/Text.hs -429 +416
    A ./GHC/IO/Handle/Types.hs
    A ./GHC/IO/IOMode.hs
    A ./GHC/IOArray.hs
    A ./GHC/IOBase.hs
    A ./GHC/IORef.hs
    A ./GHC/MVar.hs
    M ./GHC/Stable.lhs -1 +1
    M ./GHC/Storable.lhs -1 +1
    M ./GHC/TopHandler.lhs -2 +5
    M ./GHC/Weak.lhs -1 +1
    M ./Prelude.hs -1 +2
    M ./System/Environment.hs -1 +2
    M ./System/Exit.hs -1 +2
    M ./System/IO.hs -3 +6
    M ./System/IO/Error.hs -1 +3
    M ./System/IO/Unsafe.hs -1 +1
    M ./System/Mem/StableName.hs -1 +1
    M ./System/Posix/Internals.hs -23 +13
    M ./base.cabal -1 +22
    M ./include/HsBase.h -10

View patch online:
http://darcs.haskell.org/packages/base/_darcs/patches/20090612135631-12142-d835b8c74b1502f494a0fda3d6117ac6273229af.gz


More information about the Cvs-libraries mailing list