Library/Streams

From HaskellWiki
< Library
Revision as of 18:01, 4 February 2006 by JaredUpdike (talk | contribs) (English prose, first dozen paragraphs down to "Overview of Stream Transformers")
Jump to navigation Jump to search

I have developed a new I/O library that IMHO is so sharp that it can eventually replace the current I/O facilities based on using Handles. The main advantage of the new library is its strong modular design using typeclasses. The library consists of small independent modules, each implementing one type of stream (file, memory buffer, pipe) or one part of common stream functionality (buffering, Char encoding, locking). 3rd-party libs can easily add new stream types and new common functionality. Other benefits of the new library include support for streams functioning in any monad, Hugs and GHC compatibility, high speed and an easy migration path from the existing I/O library.

The Streams library is heavily based on the HVIO module written by John Goerzen. I especially want to thank John for his clever design and implementation. Really, I just renamed HVIO to Stream and presented this as my own work. :) Further development direction was inspired by the "New I/O library" written by Simon Marlow.

The key concept of the lib is the Stream class, whose interface mimics familiar interface for Handles, just with "h" replaced with "v" in function names: vGetContents, vSeek, vIsEOF, vClose and so on. This means that you already know how to use any stream! The Stream interface currently has 8 implementations: a Handle itself, raw files, pipes, memory buffers and string buffers. Future plans include support for memory-mapped files, sockets, circular memory buffers for interprocess communication and UArray-based streams.

By themselves, these Stream implementations are rather simple. Basically, to implement new streams, it's enough to provide vPutBuf/vGetBuf operations, or even vGetChar/vPutChar. The latter way, although ineffective (inefficient?), allows us to implement streams that can work in any monad. StringReader and StringBuffer streams use this to provide string-based Stream class implementations both for IO and ST monads. And, yes, you can use the full power of Stream operations inside the ST monad!

All additional functionality is implemented via Stream Transformers, which are just parameterized Streams, whose parameters also implement the Stream interface. This allows you to apply any number of stream transformers to the raw stream and then use the result as an ordinary Stream. For example:

         h <- openRawFD "test" WriteMode
                  >>= bufferBlockStream
                  >>= withEncoding utf8
                  >>= withLocking

This code creates a new FD, which represents raw files, and then adds to this Stream buffering, Char encoding and locking functionality. The result type of "h" is something like this:

WithLocking (WithEncoding (BufferedBlockStream FD))

The complete type, as well as all the intermediate types, implements the Stream interface. Each transformer intercepts operations corresponding to its nature, and passes the rest through. For example, the encoding transformer intercepts only vGetChar/vPutChar operations and translates them to the sequences of vGetByte/vPutByte calls of the lower-level stream. The locking transformer just wraps any operation in the locking wrapper.

We can trace, for example, the execution of a "vPutBuf" operation on the above-constructed Stream. First, the locking transformer acquires a lock and then passes this call to the next level. Then the encoding transformer does nothing and passes this call to the next level. The buffering transformer flushes the current buffer and passes the call further. Finally, FD itself performs the operation after all these preparations and on the returning path the locking transformer release its lock. As another example, the vPutChar call on this Stream is transformed (after locking) into several vPutByte calls by the encoding transformer, and these bytes go to the buffer in the buffering transformer, with or without a subsequent call to the FD's vPutBuf.

As you can see, stream transformers really are independent of each other. This allows you to use them on any stream and in any combination (but you should apply them in proper order - buffering, then Char encoding, then locking). As a result, you can apply to the stream only the transformers that you really need. If you don't use the stream in multiple threads, you don't need to apply the locking transformer. If you don't use any encodings other than Latin-1 -- or don't use text I/O at all -- you don't need an encoding transformer. Moreover, you may not even need to know anything about the UserData transformer until you actually need to use it :)

Both streams and stream transformers can be implemented by 3rd-party libraries. Streams and transformers from arbitrary libraries will seamlessly work together as long as they properly implement the Stream interface. My future plans include implementation of an on-the-fly (de)compression transformer and I will be happy to see 3rd-party transformers that intercept vGetBuf/vPutBuf calls and use select(), kqueue() and other methods to overlap I/O operations.

A quick comment about speed: it's fast enough -- 12-70 MB/s (depending on the type of operation) on a 1GHz cpu. Compared to the old Handles, this library shows up to a 60x speed improvement. The library includes benchmarking code in the file "Examples/StreamsBenchmark.hs"

The library is currently at the beta stage. It contains a number of known minor problems and an unknown number of yet-to-be-discovered bugs. It is not properly documented, doesn't include QuickCheck tests, is not cabalized, and not all "h*" operations still have their "v*" equivalents. If anyone wants to join this effort in order to help fix these oddities and prepare the lib for inclusion in the standard libraries suite, I would be really happy. :) I will also be happy (although much less ;) to see bug reports and suggestions about its interface and internal organization. It's just a first public version, so we still can change everything here!

Overview of Stream Transformers

Now the small overview of transformers and streams, implemented at this time.

There are 3 buffering transformers. Each buffering transformer implements support for vGetByte, vPutChar, vGetContents and other byte- and text-oriented operations for the streams, that by itself supports only vGetBuf/vPutBuf (or vReceiveBuf/vSendBuf) operations. And that is implemented, of course, by using intermediate buffer.

First transformer can be applied to any streams supporting vGetBuf/vPutBuf. It applied by the operation "bufferBlockStream". The well-known vSetBuffering/vGetBuffering operations are intercepted by this transformer and used to control buffer size. At this moment, only BlockBuffering is implemented, while LineBuffering and NoBuffering are only in plans.

Other two transformers can be applied to streams that implement vReceiveBuf/vSendBuf operations. That is the streams whose data are resides in memory, including in-memory streams and memory-mapped files. In these cases, buffering transformer don't need to allocate buffer itself, it just requests from underlying stream address and size of the next available portion of data. Nevertheless, the final result is the same - we got support for all byte- and text-oriented i/o operations. Operation "bufferMemoryStream" can be applied to the memory-based stream to add buffering to it. Operation "bufferMemoryStreamUnchecked" (which implements third buffering transformer) can be used instead if you can guarantee that i/o operations can't overflow used buffer

Encoding

Char encoding transformer allows to encode each Char written to the stream as a sequence of bytes, implementing UTF and other encodings. This transformer can be applied to any stream implementing vGetByte/vPutByte operations and in return it implements vGetChar/vPutChar and all other text-oriented operations. This transformer can be aplied to stream by the "withEncoding encoding" operation, where `encoding` may be `latin1`, `utf8` or any other encoding that you (or 3rd-party lib) implemented. Look at the "Data.CharEncoding" module to see how to implement new encodings. Encoding of stream created with the "withEncoding" operation can be changed at any moment with the "vSetEncoding" and queried with the "vGetEncoding". See examples of their usage in the file "Examples/CharEncoding.hs"

Locking

Locking transformer ensures that the stream is properly shared by several threads. You already know enough about its basic usage - "withLocking" applies this transformer to the stream and all the required locking is performed automagically. You can also use "lock" operation to explicitly acquire lock during the multiple operations:

 lock h $ \h -> do
   savedpos <- vTell h
   vSeek h AbsoluteSeek 100
   vPutStr h ":-)"
   vSeek h AbsoluteSeek savedpos

Overview of Stream Types

And now to the implemented stream types. Handle is an instance of Stream class, with the straightforward implementation. You can use the Char encoding transformer with the Handles. Although Handles implement buffering and locking by itself, you can also be interested in applying these transformers to the Handle type. This has benefits - "bufferBlockStream" works faster than internal Handle buffering, and the locking transformer enables use of "lock" operation to create a lock around sequence of operations. Moreover, locking transformer should be used to ensure proper multi-threading operation of Handle with added encoding or buffering facilities.

FD

The new method of using files, independent of the existing I/O library, is implemented with the FD type. FD is just an Int representing a POSIX file descriptor and FD type implements only basic Stream I/O operations - vGetBuf and vPutBuf. So, to create a full-featured FD-based stream, you need to apply buffering transformers. Therefore, library defines two ways to open files with FD - openRawFD/openRawBinaryFD just creates FD, while openFD/openBinaryFD creates FD and immediatelly apply buffering transformer (bufferBlockStream) to it. In most cases you will use the later operations. Both pairs mimics the arguments and behaviour of well-known Handle operations openFile/openBinaryFile, so you already know how to use them. Other transformers may be used then as you need. So, abovementioned example can be abbreviated to:

         h <- openFD "test" WriteMode
                  >>= withEncoding utf8
                  >>= withLocking

Thus, to switch from the existing I/O library to using Streams, you need only to replace "h" with "v" in names of Handle operations, and replace openFile/openBinaryFile calls with openFD/openBinaryFD while adding "withLocking" transformer to files used in multiple threads. That's all!


MemBuf

MemBuf is a stream type, that keeps its contents in memory buffer. There are two types of MemBufs you can create - you can either open existing memory buffer with "openMemBuf ptr size" or create new one with "createMemBuf initsize". MemBuf opened by "openMemBuf" will be never resized or moved in memory, and will not be freed by "vClose". MemBuf created by "createMemBuf" will grow as needed, can be manually resized by "vSetFileSize" operation, and is automatically freed by "vClose".

Actually, raw MemBufs created by the createRawMemBuf and openRawMemBuf operations, while createMemBuf/openMemBuf incorporates additional "bufferMemoryStream" call (as you should remember, buffering adds vGetChar, vPutStr and other text- and byte-i/o operations on top of vReceiveBuf and vSendBuf). You can also apply Char encoding and locking transformers to these streams.

Pipe (?)

Fourth Stream type allow to implement arbitrary streams just by providing 3 functions that implement vReceiveBuf, vSendBuf and cleanup operations. It seems that this Stream type is of interest only for my own program and can be scrutinized only as example of creating 3-party Stream types. It named "FunctionsMemoryStream", see the sources if you are interested.

Four remaining Stream types was a part of HVIO module and I copy their description from there:

In addition to Handle, there are several pre-defined stream types for your use. 'StringReader' is a particularly interesting one. At creation time, you pass it a String. Its contents are read lazily whenever a read call is made. It can be used, therefore, to implement filters (simply initialize it with the result from, say, a map over hGetContents from another Stream object), codecs, and simple I/O testing. Because it is lazy, it need not hold the entire string in memory. You can create a 'StringReader' with a call to 'newStringReader'.

'StringBuffer' is a similar type, but with a different purpose. It provides a full interface like Handle (it supports read, write and seek operations). However, it maintains an in-memory buffer with the contents of the file, rather than an actual on-disk file. You can access the entire contents of this buffer at any time. This can be quite useful for testing I/O code, or for cases where existing APIs use I/O, but you prefer a String representation. Note however that this stream type is very inefficient. You can create a 'StringBuffer' with a call to 'newStringBuffer'.

One significant improvement over the original HVIO library is that 'StringReader' and 'StringBuffer' can work not only in IO, but also in ST monad.

Finally, there are pipes. These pipes are analogous to the Unix pipes that are available from System.Posix, but don't require Unix and work only in Haskell. When you create a pipe, you actually get two Stream objects: a 'PipeReader' and a 'PipeWriter'. You must use the 'PipeWriter' in one thread and the 'PipeReader' in another thread. Data that's written to the 'PipeWriter' will then be available for reading with the 'PipeReader'. The pipes are implemented completely with existing Haskell threading primitives, and require no special operating system support. Unlike Unix pipes, these pipes cannot be used across a fork(). Also unlike Unix pipes, these pipes are portable and interact well with Haskell threads. A new pipe can be created with a call to 'newHVIOPipe'.