[Haskell] Global Variables and IO initializers: A proposal and semantics

John Meacham john at repetae.net
Tue Oct 12 18:33:13 EDT 2004


I have put some thought, some time ago, into the  'global initializers' problem in
haskell but for various reasons never wrote up my conclusions.

The issues are
1) polymorphic references allow breaking of typesafety
2) when do the initializers get evaluated
3) do we need it? 

I will address these points in no particular order.

3) yes. the 
{-# noinline :: fooVar #-}
fooVar = unsafePerformIO $ newIORef 0
is a very common idiom in real programs, and very difficult to work
around not having.
and if that is not enough, a couple more points
* we can do it horribly inefficiently and unsafely already and the world
  has not collapsed:
  via getting and setting strings in the evironment we can create global variables holding read/showable values.
  via writing and reading temporary files
  and via the FFI  just a 
 foreign import "&global_var" :: Ptr Int
  note that we do not need any foregin code, just an object which
allocates the space in the bss for global_var, the fact we can access
and work with such space from haskell, but have no way to allocate it
is quite telling that there is something missing in the language.

1) This is a real problem. A straightforward solution is to enforce top-level
IO actions to be monomorphic. This is not a big restriction, as it is
the exact same restriction placed on bindings in 'do' blocks or on
lambda expressions. 

2) is the tricky one. 

One proposed solution is to treat the global binding

fooVar <- newIORef 0 as equivalant to 
fooVar = unsafePerformIO $ newIORef 0

with the appropriate compiler magic invoked to make sure that
optimizations do not ruin the intended effect that newIORef 0 is
executed exactly once and shared by (and only by) all uses of fooVar.

This has the advantage of being very easy to implement with ghc as it
currently is. ghc can just rewrite it as the above and turn off cse for
the module and inlining for the binding. 

but there are strong disadvantages:

 * we can observe execution order by making the top level bindings have side effects.
 * We may need or want the optimizations we have to turn off, A haskell
   implementation with aggressive inter-module optimization might end up
   having to turn them off for the whole program.
 * A semantic mess to formalize and hence to optimize (IMHO)
 * not optimally efficient, a thunk for 'fooVar' is still created, and the
   IORef is created on the heap.


A better way
============

The other way draws on the works done with fixIO and recursive monadic
bindings and is much more preferable in my estimation. In addition, Like
other syntatic sugar, it admits a simple rewriting to core haskell
without any special optimizer/compiler magic required and most of the
semantics work has already been done. 

The following papers have background material:
Recursive Monadic Bindings:
http://www.cse.ogi.edu/PacSoft/projects/rmb/mfix.ps.gz
Semantics of fixIO:
http://www.cse.ogi.edu/PacSoft/projects/rmb/fics.ps.gz


Unlike some other rewritings, it is not recommended this actually be
used to compile haskell programs (among other things, it would break
separate compilation as stated), the rewriting is used to show the
coorespondence between the semantics of fixIO as written in the paper
and global bindings. An actual efficient way to implement this idea
that behaves identically and allows separate compilation follows later.


The basic idea is that your entire program behaves as if in a giant
'mdo' block, ordered in module dependency order. 


so, 
module Main

        import Bob 

        fooVar <- newIORef bob
        main = readIORef fooVar  >>= print

module Bob 

        bob <- return 3


is transformed into

main = mdo
        bob <- return 3
        let main' =  readIORef fooVar >>= print
        fooVar <- newIORef bob
        main'


(if we were to actually carry out this transformation, appropriate
renamings will have to occur to avoid name capture, and the let bindings
in mdo must be of the polymorphic variety (see 3.2 of the first paper) )

note that normal values must be pushed ahead of all the global bindings
to ensure they scope properly over whery they may be used. (assuming the
polymorphic mdo notation described in the paper)


so the general tranformation results in 

newMain = mdo
        # let values for module Foo
        # global bindings for module Foo 
        # let values for mutually recursive modules Bar and Baz
        # global bindings for modules Bar and Baz
        # let valuse for module Main
        # global bindings for module Main
        Main.main

note that mutually recursive modules must be treated as a unit and hence
the order their bindings occur in (across modules) is not well defined.
we can chalk this up to mutually recursive module oddness in general and
should not be a problem in practice.

This has a lot of nice properties, 
 * no unsafePerformIO like stuff.
 * evaluation order of the functional code does not matter, the IO actions
are carried out in the order stated in the module at initiation. 
 * well defined semantics as mentioned in the previously mentioned papers
 * admits a very efficient implementation


How to efficiently implement this:


module Foo where
        fooVar <- newIORef 0
        showsVar <- newIORef 0
        tick = modifyIORef fooVar (+1) 
        showTicks = do
                putStr "Number of ticks: "
                readIORef fooVar >>= print
                modifyIORef showsVar (+1)
 

So, the compilation procedes as

statically allocate space for pointers to two haskell thunks. perhaps in
the bss or initialized data segment.

the two pointers shousd originally point to the equivalant of 
error "Strict Loop in global bindings"  (or a better error message,
perhaps integrating a line number/file)

the module creates an internal procedure say 'Foo._init_' which upon
first run, sets a flag saying it is being run, calls _init_ in each of its imported modules, executes each
IO action assosiated with its global bindings in turn, and updates them
to point to the value retured by the IO action.  if the flag saying it
has already (or is in the progress of) being run is already set, the
_init_ procedure does nothing and returns immediatly.


we then use this main' as our entry point.
main' = do
        Main._init_
        Main.main




Note that all the error thunks (or holes) are completly removed by the
time main starts. the only way to get one is to create recursive
bindings that are strict in a loop. such as
x <- return $! x
which is just as well defined as
note that
x <- return x 
is equivalant to
x = x
as we expect. 

the global variables can be allocated statically, meaning we can
generate much better code with offsets directly to our values.
strictness analysis can be applied and if it determines an integral
IORef is always passed strict values for instance, it can unbox the
global int. 

for the purposes of GC, the global variables can be treated like CAFs. 

So, I believe this is a clean and efficient way to allow global state in
haskell. sorry for the long post and bad grammar, I had to just sit and
punch this out or I'd never write it down :)

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈ 


More information about the Haskell mailing list