Libraries and hierarchies

Simon Marlow simonmar@microsoft.com
Fri, 1 Aug 2003 11:10:26 +0100


What follows is a proposal from Simon P.J. and myself, for solving some
of the problems that have arisen with hierarchical modules in Haskell.

We think this is quite a nice solution: it decentralises the allocation
of names, allows versioning of libraries and referencing libraries by
GUID, allows relative module names, and allows moving modules within the
hierarchy without changing the source code, all without adding any extra
syntax :-) =20

Please tell us what you think.  This will only get adopted if everyone
agrees, because it needs all the Haskell implementations on board in
order to work.

THE PROBLEM
-----------

Problem 1: Allocating names in the hierarchy

At the moment, we have a scheme of central registration (albeit
informally on this list), along with a way for users to name libraries
based on their email address (eg. User.Com.Microsoft.Simonmar.Foo).
This is unsatisfactory, because (a) having a central registry for
names is too cathedralish, and (b) it's inconvenient to use the
email-address form because it gives rise to overly long module names.

Problem 2: Moving a module tree around

Suppose you have a tree of modules Control.Monad, Control.Monad.X,
Control.Monad.Y etc, and you want to move them from Control to some
other place Foo.Baz in the hierarchy, to give Foo.Baz.Monad,
Foo.Baz.Monad.X, etc.  At the moment you have to visit every module
and change its module header to give the correct absolute path name.

Problem 3: Long module names in imports

It's plain tiresome to have to write
	import User.Simon.Text.PrettyPrint.HughesPJ
Long path names, repeated all over the source tree, are painful.
They are particularly painful when you want to refer to another
module in the same library -- then, if you decide to put the library
somewhere else, you have to change all its internal imports.


A POSSIBLE SOLUTION
~~~~~~~~~~~~~~~~~~~
The key idea is this: there is no longer a single global hierarchy of
modules, but every site and every user has the means to populate their
own module hierarchy as they see fit, with off-the-shelf and local
libraries.

This means that when you install a package (a sub-hierarchy of modules),
you get to choose where in the global hierarchy on your system it is
rooted.  There would probably be a default, which you would most often
go along with unless it clashes with another library on your system, in
which case you might choose to site it somewhere else.  You can even
choose to site it in several places in the tree.

If you want several versions of a library installed on your system, you
can do that too, provided you site them at different places in the
hierarchy.  If you install a new version of a library, just re-site the
old one to a version-specific place, and install the new one.

eg. I install the GTK+HS library on my system.  By default, it sites
itself under

      Graphics.UI.Gtk.*
      Graphics.UI.Gtk.V0-15.*

and possibly additionally sites itself under a GUID-based root, or one
based on an API hash:

      GUID_XXXXXXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.*

When I install a new version of GTK+HS, say version 0.16, I can replace
the existing Graphics.UI.Gtk with the new version, and I now have:

      Graphics.UI.Gtk.*  -- now refers to version 0.16
      Graphics.UI.Gtk.V0-15.*
      Graphics.UI.Gtk.V0-16.*

and two distinct GUID sites.

You could even use this mechanism, in a per-user configuration file say,
to site the GTK+HS library directly in Gtk.*, if you're too lazy to type
Graphics.UI all the time.  This wouldn't be recommended though: any
source code you write won't compile on someone else's system that
doesn't have the same convention.

PACKAGES and DISTRIBUTION
~~~~~~~~~~~~~~~~~~~~~~~~~
A "package" is the unit of distribution.  A package includes a
sub-hierarchy of Haskell modules, as well as perhaps other stuff (such
as C header files etc).  When installing a package one specifies one or
more "sites" at which the modules are to be grafted into the module
hierarchy.  A site is just the module prefix to be used for modules in
that package.   The sub-hierarcy within a package is not changed by
re-siting.  A package comes with a list of "default sites", where it
will be installed by default.

It is an install-time error to graft in a package at a site that means
that a single module name is defined twice.

The actual representation of a package in distributable form may vary
between implementations.  For example, Hugs may need only source files,
while GHC may distribute interface files and binaries.  All that matters
is that for any particular implementation (Hugs, say) there's a
specified way to take a package and install it at one or more sites in
that Hugs-compiler's module hierarchy.

For example, GHC's package configuration for an installed package
currently looks something like this:
=20
   {
      name =3D "mylib",
      import_dirs =3D ["/usr/local/lib/mylib/imports"],
      ...
   }

and we suggest adding an extra field, sites:

   {
      name =3D "mylib",
      import_dirs =3D ["/usr/local/lib/mylib/imports"],
      sites =3D [ "Foo.Bar.MyLib", "Foo.Bar.MyLib.V2.3", "GUID_XXXX" ]
      ...
   }


SOURCE CODE
~~~~~~~~~~~
What about module names in the source code, and how are modules
compiled?  Suppose I am compiling a package whose default site is
Foo.Bar, and containing modules Foo.Bar.A.B, and Foo.Bar.A.C (assuming
it is installed at the default site).  I put the source code in A/B.hs
and A/C.hs, and the code would look like this:

    module A.B where
    import A.C

The implementation must obey the following rule:
	When compiling a module belonging to a package, that package
	is temporarily grafted into the root of the module hierarchy.

This means that 'import A.C' will find the module A.C from the package
being compiled.  If there is already a global module A.C, the package
module "wins"; so the global module A.C is inaccessible.  (There could
be some extra mechanism to get around this, if it seems important.)

Modules in other packages can be imported only by uttering their full
path names in the global hierarchy (of the compiler that is compiling
the package).

After installing the library, the tree of modules it contains will be
grafted into the global hierarchy at possibly many places, and the
modules can then only be imported by uttering their full path names in
the global hierarchy.

Alternative design: modules in the current package could be specified
explicitly, perhaps by prefixing them with '.'.  This would avoid the
possibility of overlap between the current package and the global
hierarchy, at the expense of having to add lots of extra '.'s.

IMPLEMENTATION
~~~~~~~~~~~~~~
How do we implement this?  For Hugs, it should be relatively
straightforward; two implementations spring to mind.  Either

  (a) transform the source files as they are installed, to=20
      replace package-relative module names with absolute names.
      Multiply-sited packages are implemented by copying and
      transforming the source code into several places.

  (b) have Hugs do the module fixup at load-time, and change the
      search strategy to take into account package-relative imports.
      Multiply-sites packages can be done with symbolic links, or
      straight copying of source files.

For GHC and other systems with compiled libraries, it's a bit trickier.
We need to make sure that each symbol in the compiled library cannot
clash with any symbol in any other library.  One way to do this is to
include the package name in each symbol, and require that package names
are unique (perhaps include a GUID in a package name).