[Haskell-cafe] A distributed and replicating native Haskell database

Fri Feb 2 10:06:53 EST 2007

Joel Reymont wrote:
> Folks,
>
> Allegro Common Lisp has AllegroCache [1], a database built on B-Trees 
> that lets one store Lisp objects of any type. You can designate 
> certain slots (object fields) as key and use them for lookup. ACL used 
> to come bundled with the ObjectStore OODBMS for the same purpose but 
> then adopted a native solution.
>
> AllegroCache is not distributed or replicating but supports automatic 
> versioning. You can redefine a class and new code will store more (or 
> less) data in the database while code that uses the old schema will 
> merrily chug along.
That implies being able to put persistent code into the database. Easy 
enough in Lisp, less easy in Haskell. How do you serialize it?

As a rule, storing functions along with data is a can of worms. Either 
you actually store the code as a BLOB or you store a pointer to the 
function in memory. Either way you run into problems when you upgrade 
your software and expect the stored functions to work in the new context.
> Erlang [2] has Mnesia [3] which lets you store any Erlang term 
> ("object"). It stores records (tuples, actually) and you can also 
> designate key fields and use them for lookup. I haven't looked into 
> this deeply but Mnesia is built on top of DETS (Disk-based Term 
> Storage) which most likely also uses a form of B-Trees.
Erlang also has a very disciplined approach to code updates, which 
presumably helps a lot when functions are stored.
>
> Mnesia is distributed and replicated in real-time. There's no 
> automatic versioning with Mnesia but user code can be run to read old 
> records and write new ones.
>
> Would it make sense to build a similar type of a database for Haskell? 
> I can immediately see how versioning would be much harder as Haskell 
> is statically typed. I would love to extend recent gains in binary 
> serialization, though, to add indexing of records based on a 
> designated key, distribution and real-time replication.
I very much admire Mnesia, even though I'm not an Erlang programmer. It 
would indeed be really cool to have something like that. But Mnesia is 
built on the Erlang OTP middleware. I would suggest that Haskell needs a 
middleware with the same sort of capabilities first. Then we can build a 
database on top of it.
> What do you think?
>
> To stimulate discussion I would like to ask a couple of pointed 
> questions:
>
> - How would you "designate" a key for a Haskell data structure?
I haven't tried compiling it, but something like:

class (Ord k) => DataKey a k | a -> k where
keyValue :: a -> k

> - Is the concept of a schema applicable to Haskell?
The real headache is type safety. Erlang is entirely dynamically typed, 
so untyped schemas with column values looked up by name at run-time fit 
right in, and its up to the programmer to manage schema and code 
evolution to prevent errors. Doing all this in a statically type safe 
way is another layer of complexity and checking.

Actually this is also just another special case of the middleware case. 
If we have two processes, A and B, that need to communicate then they 
need to agree on a protocol. Part of that protocol is the data types. If 
B is a database then this reduces to the schema problem. So lets look at 
the more general problem first and see if we can solve that.

There are roughly two ways for A and B to agree on the protocol. One is 
to implement the protocol separately in A and B. If it is done correctly 
then they will work together. But this is not statically checkable 
(ignoring state machines and model checking for now). This is the Erlang 
approach, because dynamic checking is the Erlang philosophy.

Alternatively the protocol can be defined in a special purpose protocol 
module P, and A and B then import P. This is the approach taken by CORBA 
with IDL. However what happens if P is updated to P'? Does this mean 
that both A and B need to be recompiled and restarted simultaneously? 
Requiring this is a Bad Thing; imagine if every bank in the world had to 
upgrade and restart its computers simultaneously in order to upgrade a 
common protocol. (This protocol versioning problem was one of the major 
headaches with CORBA.) We would have to have P and P', live 
simultaneously, and processes negotiate the latest version of the 
protocol that they both support when they start talking. That way the 
introduction of P' does not need to be simultaneous with the withdrawal 
of P.

There is still the possibility of a run-time failure at the protocol 
negotiation stage of course, if it transpires that the to processes have 
no common protocol.

So we need a DSL which allows the definition of data types and abstract 
protocols (i.e. who sends what to whom when) that can be imported by the 
two processes (do we need N-way protocols?) on each end of the link. If 
we could embed this in Haskell directly then so much the better, but 
something that needs preprocessing would be fine too.

However there is a wrinkle here: what about "pass through" processes 
which don't interpret the data but just store and forward it. Various 
forms of protocol adapter fit this scenario, as does the database you 
originally asked about. We want to be able to have these things talk in 
a type-safe manner without needing to be compiled with every data 
structure they transmit. You could describe these things using type 
variables, so that for instance if a database table is created to store 
a datatype D then any process reading or writing the data must also use 
D, even though the database itself knows nothing more of D than the 
name. Similarly a gateway that sets up a channel for datatype D would 
not need to know anything more than the name.

Paul.