Lost output in multithreaded program

Simon Marlow simonmar@microsoft.com
Thu, 31 Jul 2003 10:39:49 +0100


I haven't investigated in detail, but I believe your problem might be
caused by a known bug in the implementation of forkProcess, namely that
when forking from a child thread it doesn't kill the main thread.  The
workaround is to fork (in your case call popen) from the main thread,
not a child thread.

Cheers,
	Simon

> I am puzzled by the behaviour of this (stripped-down,=20
> uglified) program.=20
>   It is supposed to run a bunch of shell commands simultaneously,=20
> collate their standard output and error, and print their output as=20
> though they had run sequentially.
>=20
>      module Main where
>=20
>      import Control.Concurrent
>      import Data.Maybe
>      import GHC.Handle
>      import GHC.IO
>      import Posix (popen)
>=20
>      main =3D do
>          cmds    <- getContents >>=3D return.lines
>          outCh   <- newChan
>          mainThr <- myThreadId
>          forkIO $ do
>                  mapM_ (startCommand outCh) cmds
>                  writeChan outCh $ killThread mainThr
>          getChanContents outCh >>=3D sequence_
>=20
>      startCommand :: Chan (IO ()) -> String -> IO ()
>      startCommand ch cmd =3D do
>          -- Prevent lazy reads after forking
>          length cmd `seq` return ()
>          (out,err,pid) <- popen "/bin/sh" ["-c",cmd] Nothing
>          -- Prevent deadlock (waiting to read stdout
>          -- while the child waits to write stderr).
>          forkIO $ length err `seq` return ()  -- culprit line?
>          writeChan ch $ hPutStr stdout out
>          writeChan ch $ hFlush stdout
>          writeChan ch $ hPutStr stderr err
>          writeChan ch $ hFlush stderr
>=20
> Most of the time it works but every so often a chunk of child output=20
> gets lost.  (I'm sure I've seen it duplicating chunks of=20
> output too, but=20
> I haven't been able to isolate a test case.)  Corruption of=20
> stderr seems=20
> more common than corruption of stdout.  The corruption=20
> doesn't get more=20
> predictable if I generate a list of commands with fixed=20
> 'random' sleeps=20
> and use that (so I don't think it's a gross timing issue).  The lost=20
> data is not necessarily a prefix of the output string, nor a suffix.=20
> The lost data does not reappear later in the program's=20
> output.  The lost=20
> data *does* appear to correspond to byte sequences that are=20
> written in a=20
> single write() by the child process.
>=20
> Example:
>=20
>      $ cat test.sh
>      #!/bin/sh
>      echo -n "std"        ; sleep 1
>      echo -n "STD"    >&2 ; sleep 1
>      echo       "out"     ; sleep 1
>      echo       "ERR" >&2 ; sleep 1
>      $ for i in `seq 100`; do echo "sleep $((RANDOM % 10));=20
> ./test.sh";=20
> done | ./a.out
>    <snip>
>      stdout
>      STDERR
>      stdout
>      ERR
>      stdout
>      STDERR
>    <snip>
>      stdout
>      STDERR
>      stdout
>      STDstdout
>      STDERR
>      stdout
>      STDERR
>    <snip>
>=20
> If I comment the 'culprit line' above then the program runs without=20
> error.  This confounds me: I believe that that line should have no=20
> effect except to suck the child's stderr into the program as soon as=20
> possible.  (Am I wrong?  And is there a better way of doing this?)
>=20
> Platform: ghc-6.0-7 / RedHat 8.0 / x86.
> Compilation: ghc -package posix test.hs
>=20
> Where did my output go?  Have I run into some gotcha like the 'lazy=20
> reads after forking' thing?  Can someone tell me what I'm doing wrong?
>=20
> Hopeful thanks,
> // David
> --=20
> David Hughes
> UNIX sysadmin, Serco SA         -+-         Tel.: +41 22 767 8997
> Computing Centre, CERN          -+-        David.W.Hughes@cern.ch
>=20
> This message expresses my own opinions and should not be construed
> as the opinions of Serco (who employ me) or of CERN (where I work).
>=20
> _______________________________________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users@haskell.org
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>=20
>=20