[Haskell-cafe] Designing DSL with explicit sharing [was: I love purity, but it's killing me]

Thu Feb 14 22:51:24 EST 2008

Matthew Naylor wrote:
> it's not immediately clear (to me at least) how efficient your method
> will be in "practice".  Any method based on common sub-expression
> elimination surely must inspect every node in the flattened graph.  In
> the worst case, an acyclic graph containing n nodes could have 2^n
> nodes when flattened to a tree:
>
>   tricky 0 = constant 0
>   tricky d = add g g
>     where
>       g = tricky (d-1)

It should work quite well in practice, as demonstrated below. Once our
DSL is extended with a let form (so the user can declare the intention
to share results of computations rather than computations themselves),
tricky code becomes trivial. The code remains declarative, safe and
pure, and, save for the (cosmetic) lack of the monomorphism
restriction, Haskell98. The well-commented code is available at

	http://okmij.org/ftp/Haskell/DSLSharing.hs

First a few comments on the tricky code. It is true that (absent
programmer's proclamations and code structuring forms) a complete
method of common sub-expression elimination must inspect every
sub-expression in the program. That inspection must include at least
one comparison of a node with some other nodes. As I understand the
original problem had less to do with the number of comparison but more
to do with the cost of a single comparison. In an impure language, we
can use constant-time physical equality. It is usually provided
natively as pointer comparison, and can be trivially emulated via
mutation. The original poster used pure Haskell and the following
representation for the program graph:
 data Expression
   = Add Expression Expression
   | Sub Expression Expression
   | Variable String
   | Constant Int
   deriving Eq

Here, the derived equality operation, albeit pure, no longer takes
constant time. Comparing nodes near the root of a deep tree really
takes a while. As the previous message on DSL with sharing showed, one
can do common sub-expression elimination and the conversion of the
program tree to a DAG using a constant-time node equality operation,
without sacrificing purity.

The tricky function is a metaprogram that builds a large DSL program:
(tricky 12) yields a single DSL expression with 8191 nodes. Such
humongous expressions with thousands of components are unlikely to be
written by human programmers (although can be easily generated). Many
compilers will choke if we submit a program with one large
expression. (Incidentally, the previously posted code converts the
tree (tricky 12) to a 12-node DAG in 0.25 secs on 1GHz Pentium).

The reason we can compile much larger projects is because we structure
them, into compilation units, functions, let-blocks. The structure
gives the common sub-expression eliminator and other optimizer phases
much needed hints and so significantly limits the search space. For
example, hardly any compiler searches for common sub-expressions
across compilation units. More importantly for us here, forms like
let-expressions let the programmer declare that the results of certain
computations be shared. Although the tricky code also contains
a let-expression (masqueraded as 'where'), the sharing there occurs at
the meta-level and benefits the generator rather than the generated
program. We share code generators rather than the generated code. The
importance of placing 'let' in the right phase has been extensively
discussed in
   http://www.cs.rice.edu/~taha/publications/conference/pepm06.pdf
also in the context of a DSL (for dynamic programming).

In our case, we need to give our DSL programmer a way to state their
intention on sharing results of DSL computations. The programmer
declares certain expressions common, and thus greatly helps the
compiler as well as human readers of the code. As Chung-chieh Shan has
pointed out, we need to introduce a "let" construct in the _embedded_
language. Since our DSL already has (concrete) variables with string
names, we can extend our language thusly:
	let_ "v1" (add (constant 1) (variable "x)) 
		  (add (variable v1") (variable "v1"))
We chose a different way: higher-order abstract syntax. We use variables
of the metalanguage (that is, Haskell) as let-bound variables. Our
language is now defined as

> class Exp repr where
>    constant :: Int -> repr Int
>    variable :: String -> repr Int
>    add  :: repr Int -> repr Int -> repr Int
>    sub  :: repr Int -> repr Int -> repr Int
>    let_ :: repr a -> (repr a -> repr b) -> repr b -- like flip ($)

Here are simple programs in our language

> a = add (constant 10) (variable "i1")
> b = sub (variable "i2") (constant 2)
> c = add a b
> d = add c c
> e = add d d    -- "e" now as 16 leaf nodes.
> e' = let_ d (\x -> add x x)

The programs (e) and (e') evaluate to the same integer given the same
environment for "i1" and "i2". The two programs differ in how sharing
is declared. The program (e) uses the identifier (d) twice; even if
GHC shares the corresponding expressions rather than copies them (a
Haskell system is not obliged to share anything: sharing is not
observable in Haskell98), what GHC shares are metalanguage
computations. Although the metalanguage (Haskell) is pure, the object
language does not have to be. Indeed, the common-subexpression
elimination, when considered as an evaluation of an object expression,
is an impure evaluation. The same object expression may give, in
different contexts, different results (that is, compile to different
assembly code). In the case of (e'), we explicitly stated that "d"
is a common sub-expression. It must be executed once, with the results
shared. The difference is easy to see if we print e and e':

 *DSLSharing> test_showe
 "10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2"

 *DSLSharing> test_showe'
 "let v0 = 10 + i1 + i2 - 2 + 10 + i1 + i2 - 2 in v0 + v0"

Higher-order syntax for let_ (see e') seems better than that based on
concrete variable names. As test-showe' demonstrates, we can easily
convert from higher-order abstract to the concrete syntax. The
converse is much more cumbersome.

We easily extend the previously written common sub-expression
eliminator by adding the clause for let_

> instance Exp A where
>     ...
>     let_ e f = A(do
> 		 x <- unA e
> 		 unA $ f (A (return x)))

And that is it. We can write the tricky function with the explicit
sharing:

> tricky' 0 = constant 0
> tricky' d = let_ (tricky' (d-1)) (\g -> add g g)
>
> test_tricky' n = runState (unA (tricky' n)) exmap0

 *DSLSharing> test_tricky' 12
 (AAdd 12,
   ExpMaps {hashcnt = 13, ctmap = fromList [(0,0)], vrmap = fromList [],
   admap = 
    fromList [(1,(AConst 0,AConst 0)),(2,(AAdd 1,AAdd 1)),
              (3,(AAdd 2,AAdd 2)),(4,(AAdd 3,AAdd 3)),(5,(AAdd 4,AAdd 4)),
              (6,(AAdd 5,AAdd 5)),(7,(AAdd 6,AAdd 6)),(8,(AAdd 7,AAdd 7)),
              (9,(AAdd 8,AAdd 8)),(10,(AAdd 9,AAdd 9)),(11,(AAdd 10,AAdd 10)),
              (12,(AAdd 11,AAdd 11))], sumap = fromList []})
(0.01 secs, 0 bytes)

Indeed it takes negligible time. In fact,
	test_tricky' 12 takes (0.01 secs, 561860 bytes)
	test_tricky' 100 takes (0.03 secs, 1444888 bytes)
If the DAG corresponding to (test_tricky' 100) were converted to a
tree, it would have had 2^101 -1 nodes (which is more than 10^31).