[Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Mon May 28 15:20:36 EDT 2007

On Thu, May 24, 2007 at 10:17:49PM +0100, Jules Bean wrote:
> I've been having something of a discussion on #haskell about this but
> I had to go off-line and, in any case, it's a complicated issue, and I
> may be able to be more clear in an email.
> 
> The key point under discussion was what kind of interface the HTTP
> library should expose: synchronous, asynchronous? Lazy, strict?
> 
> As someone just pointed out, "You don't like lazy IO, do you?". Well,
> that's a fair characterisation. I think unsafe lazy IO is a very very
> cute hack, and I'm in awe of some of the performance results which
> have been achieved, but I think the disadvantages are underestimated.

> Of course, there is a potential ambiguity in the phrase 'lazy IO'. You
> might interpret 'lazy IO' quite reasonably to refer any programming
> style in which the IO is performed 'as needed' by the rest of the
> program. So, to be clear, I'm not raising a warning flag about that
> practice in general, which is a very important programming
> technique. I'm raising a bit of a warning flag over the particular
> practice of achieving this in a way which conceals IO inside thunks
> which have no IO in their types: i.e. using unsafeInterleaveIO or even
> unsafePerformIO.
> 
> Why is this a bad idea? Normally evaluating a haskell expression can
> have no side-effects. This is important because, in a lazy language,
> you never quite know[*] when something's going to be evaluated. Or if
> it will. Side-effects, on the other hand, are important things (like
> launching nuclear missiles) and it's rather nice to be precise about
> when they happen. One particular kind of side effect which is slightly
> less cataclysmic (only slightly) is the throwing of an exception. If
> pure code, which is normally guaranteed to "at worst" fail to
> terminate can suddenly throw an exception from somewhere deep in its
> midst, then it's extremely hard for your program to work out how far
> it has got, and what it has done, and what it hasn't done, and what it
> should do to recover. On the other hand, no failure may occur, but the
> data may never be processed, meaning that the IO is never 'finished'
> and valuable system resources are locked up forever. (Consider a naive
> program which reads only the first 1000 bytes of an XML document
> before getting an unrecoverable parse failure. The socket will never
> be closed, and system resources will be consumed permanently.)

Yes, obviously lazy IO needs to be done with care, but pure functions
always consume resources, and lazy IO is not unique in this regard.  It
does change the nature of the resources consumed, but that's all.  No
function can "at worst" fail to terminate, they can always fail with error,
or run out of stack space.

It seems that your real problem here is that sockets aren't freed when
programs exit.  I suppose that's a potential problem, but it doesn't seem
like a critical one.  I assume firefox has already permanently consumed
gobbs of system resources, and it hasn't bothered me yet... except for the
memory, and that's fortunately not permanent.  (Incidentally, couldn't
atexit be used to clean up sockets in case of unclean exiting?)

Obviously lazy IO can only be used with IO operations that are considered
"safe" by the programmer (usually read operations), but for those
operations, when the programmer declares himself to not care when the
reading is actually done, lazy IO is a beautiful thing.  In particular, it
allows the writing of modular reusable functions.  That's actually a Good
Thing... and as long as write operations are the only ones that require
cleanup, it's also perfectly safe.

> Trivial programs may be perfectly content to simply bail out if an
> exception is thrown. That's very sensible behaviour for a small
> 'pluggable' application (most of the various unix command line
> utilities all bail out silently or nearly silently on SIGPIPE, for
> example). However this is not acceptable behaviour in a complex
> program, clearly. There may be resources which need to be released,
> there may be data which needs saving, there may be reconstruction to
> be attempted on whatever it was that 'broke'.
>
> Error handling and recovery is hard. Always has been. One of the
> things that simplifies such issues is knowing "where" exceptions can
> occur. It greatly simplifies them. In haskell they can only occur in
> the IO monad, and they can only occur in rather specific ways: in most
> cases, thrown by particular IO primitives; they can also be thrown
> 'To' you by other threads, but as the programmer, that's your
> problem!.

This is irrelevant to the question of lazy IO or not lazy IO.  As you say,
all errors happen in the IO monad, and that's true with or without lazy IO,
since ultimately IO is the only consumer of lazy data.  Proper use of
bracket catches all errors (modulo bugs in bracket, and signals being
thrown... but certainly all calls to error), and you can do that at the top
level, if you like.

The downside in error checking when using lazy IO is just that the part of
your program where errors pop up becomes less deterministic.  However,
since errors can happen at any time even without lazy IO, this is only a
question of probability of errors showing up at certain times (think out of
memory conditions, signals thrown, etc).  Well-designed programs will be
written robustly.  (Yes, that's a truism, but it's one you seem to be
forgetting.)

> Ok. Five paragraphs of advocacy is plenty. If anyone is still reading
> now, then they must be either really interested in this problem, or
> really bored. Either way, it's good to have you with me! These issues
> are explained rather more elegantly by Oleg in [1].
...
> Given these three pairs of options, what need is there for an unsafe
> lazy GET?  What niche does it fill that is not equally well filled by
> one of these?
> 
> Program conciseness, perhaps. The kind of haskell oneliner whose
> performance makes us so (justly) proud. In isolation, though I don't
> find that a convincing argument; not with the disadvantages taken also
> into account.
> The strongest argument then is that you have a 'stream processing'
> function, that is written 'naively' on [Word8] or Lazy ByteString, and
> wants to run as data is available, yet without wasting space. I'm
> inclined to feel that, if you really want to be able to run over 650M
> files, and you want run as data is available, then you in practice
> want to be able to give feedback to the rest of your application on
> your progress so far; I.e, L.Bytestring -> a is actually too simple a
> type anyway.

Yes, this is the argument for lazy IO, and it's a valid one.  Any
adequately powerful interface can be used to implement a lazy IO function,
and people will do so, whether or not it makes you happy.  It'd be nice to
have it in the library itself.

Program conciseness is a real issue.  Simple effective APIs make for useful
libraries, and the simplest API is likely to be the most commonly used.  If
the simplest API is strict, then that means that there'll most often be
*no* feedback until the download is complete.  A lazy download means that
feedback can be provided instantly, as the data is consumed.  True, you
need to include some feedback logic in your consumer, but that's where
you'll almost certainly want it anyhow.  And in many cases the feedback
could come for free, in the form of output.
-- 
David Roundy
Department of Physics
Oregon State University