restarting the discussion

Henrik Nilsson nilsson@cs.yale.edu
Thu, 08 Feb 2001 20:12:26 -0500


Hi all,

Malcolm Wallace wrote:

> 
> Hello to everyone who has joined the HaskellDoc mailing list.  We had
> a little bit of discussion before announcing the list more widely,
> but everything now seems to have stopped dead.  So it's time to get
> thoughts rolling again.  Do check the list archive on haskell.org
> to see what has already happened.
> 
> I'll start by declaring my interest in automatic documentation.

Position statements seem to be order of the day (or week, maybe).
So here are some points on what I believe a good standard should look
like.

1. I agree with Malcolm and Jan Skibinski that the documentation
   conventions need to be lightweight. (I too dislike literate
   programming, except possibly when the aim is to write a paper
   or a book.)

2. I think the documentation standards should be able to support
   both internal and external documentation.

3. I believe a standardized, intermediate, "raw" documentation
   format would be useful.

4. I think the intermedite format should be based on XML.

I will now discuss each point in turn.

-----------------------------------------------------------------------

1. I agree with Malcolm and Jan that the documentation conventions
   need to be lightweight. (I too dislike literate programming,
   except possibly when the aim is to write a paper or a book.)

However, I think that relying solely on positional cues might be
too constraining and (in te long run) inflexible. So personally,
I think HDOc/JavaDoc-like tags is a good compromise.

To that, I also see a need to add some lightweight conventions
for markup of explanatory text. E.g. I'd like to be able to mark
variable names, emphasize a piece of text, and maybe include
small code fragments.

Jan propses to use conventions like 'xxx' for variable names
and "zzz" for emphasis (I think). That's probably reasonable,
and I indeed use the 'xxx' convention in my own comments
sometimes. But one should be aware that this useage can conflict
with the normal meaning of the quote characters. In particular,
other lightweight emphasis conventions like _yyy_ or *zzz*
spring to mind.

I would even find it acceptable with some more heavyweight
conventions for marking entire paragraps, such as <code> and </code>.
This would be very useful for including useage examples in
external documentation, for instance.

I would like the possibiility to include pictures (as opposed to
having to rely on ASCII graphics). Take a look at the Fudgets
documentation for examples showing how useful this can be.

Finally, I do agree with Malcolm that XML is far to heavy to be used
at this level.

-----------------------------------------------------------------------

2. I think the documentation standards should be able to support
   both internal and external documentation.

By internal documentation, I mean documentation of the source code
as such, intended for people who needs to read and understand source
code such as developers and maintainers.

By external documentation I mean documentation of interfaces, intended
for people who needs to use a piece of software but who do not need
to know about the internal details. I guess this mainly applies
to library interfaces, but one could also consider manpage-style
application documentaion (cf. POD from the Perl world).

Since the markup needs for internal and external documentation are
pretty similar, I don't think it will be very difficult to develop
a standard supporting both. The main thing which has to be added
is a way of declaring if a piece of documentation is for internal
or external (or maybe both) use. Having two different commenting
conventions (e.g. "{--" and "{---") would be a possibility. Another
possibility, probably more flexible, is to have some initial tag.

For external documentation, it may also be useful to have a possibility
to generate documentation at different levels of detail. For instance,
for a very large library, it might be uesful to have both brief
beginner documentation, more extensive programmer documentation, and
full documentation (e.g. including obsolete, deprecated features).
Again, the Fudget documentation is a good example (and where I picked
up the idea). It would seem as if a comment classification scheme
based on initial tags easily could be adapted for this kind of use
as well.

Once the documentation comments have been classified, generating
internal or external documentation is rally a tool issue. For
internal documentation, a tool would basically just have to
extract type signatures (or infer them), type definitions, class
definitions, etc. along with all internal documentation comments.

For proper external documentation, a good tool also has to take
import and export into account. For instance, a library could
be made up of a number of modules which are collected and
re-exported by one single "top-level" module. The users are
not supposed to have to know about the internal library structure,
but only sees the one module. Thus, when generating documentation
for this module, the tool would have to collect documentation for
the re-exported entities from _other_ modules.

------------------------------------------------------------------------

3. I believe a standardized, intermediate, "raw" documentation
   format would be useful.

I've argued above that it would be desirable to support at least
two different types of documentation. Furthermore, documentation
could conceivably be rendered in a plethora of different formats:
HTML, PDF, postscript, info, LaTeX, DocBook, etc. Different people
and organizations may even have specialized formatting needs. For
instance, assuming e.g. a HDoc/JavaDoc-like convention where the
very first sentence of a doocumentation comment gives a synopsis,
someone maintaining a collection of libraries (e.g. on haskell.org)
might like a tool that extracts only this information for each
library, so that someone browsing through the collection of
libraries quickly can determine whether a particular library fits
the bill or not. Or imagine an organization where all documentation
has to conform to some strictly defined, internal standard. The
possibilities are, if not endless, at least extensive. Add to this
other applications such as searching through a library (or collection
of libraries) based of type information (an old idea which often is
quite useful, but sadly neglected in today's functional programming
environments).

All tools carrying out tasks like those suggested above share a
common need: a (preferably easy) way to extract "meta" information
from source code. For exaple:

  * Names of all exported entities (i.e. "canonical",
    fully expanded export list).
  * Origin info for exported entities not defined locally.
  * Names of all top-level entities defined in a module.
  * For types and classes, their definitions.
  * Type signatures for functions and method instances.
  * Author-supplied documentation associated with the various
    top-level entities.
  * Maybe source code positions, or at least the name of the file
    in which something is defined.
  * Fixity declarations.
  * Perhaps even strictness signatures.

There are different ways to get such information. In some cases,
simple matching based on regular expressions might be enough.
Unfortunately, such solutions tend to be fragile, in particular for a
language whith the lexical and syntactical conventions of Haskell
(take nested comments, for one example). It is also unclear to
what extent such solutions could be shared between different tools.

Another approach would be to provide a (simplified, specialized)
Haskell parser with a clearly defined interface making information
like what was described above available. This would no doubt prove
to be very popular for people wanting to develop various documentation
tools. But if this interface was to be standardized, e.g. in the form
of an algebraic data type in Haskell, then this would not be directly
useful for people wishing to develop using some other language. Also,
Haskell types are not very extensible, which would create all sorts
of compatibility problems if the standard was to evolve.

A third approach would be to define a standard, intermediary
documentation format which is easy to generate (once one have
the necessary information) and parse. Then, as long as at least
one tool generating this format exists, it would be be relatively
straightforward to develop all sorts of formatters and other
creative applications around this.

(Looking back at the history of Haskell documentation tools, this has
actually happened at least three times: "FudgetsDoc" and HaskellDoc
both used HBC's interface files to get type information, and more
recently Jan Skibinski's source code browser which uses GHC's
interface files in a similar way. But of course, in all cases, these
tools became tied to one (or two) particular compiler(s), they
became likely to break if the format of the interface files changed,
and they were limited by the information that happened to be
available. Hence the need for a standard.)

Personally, I think a compiler would be in a good position to
generate intermediary documentation files since it has access to
all (or at least most) informatin that is needed. (This is also
not without precedent: Sun's Workshop C compiler can emit information
for a browsing tool, the CenterLinc C compiler used to do something
similar, and asking compilers for module dependence information is a
basically a simple instance of the same idea.) On the other hand,
there are some problems such as the need to respect user-supplied type
signatures (as opposed to always using the inferred ones), and the
fact that the types of non-exported entities might be thrown away
at some inconveniently early point. So not everyone likes this.

However, how intermediary documentation is generated is a secondary
issue. Having a well-specified format means that anyone who would
like to write a tool supplying such information has something to
aim at, and that anyone who is manly interested in doing something
with such information has a goodplace to start from.

Finally, I believe that developing the source-level documentation
conventions and an intermediary documentation format in parallel
will be mutually beneficial. Defining the intermediary format
will force us to think about what documentation *is* (without
the need to consider specific renderings) and thus what information
that needs to be provided by the commenting conventions. Converesly,
practical requirements such as the source code remaining legible with
prevent the intermediary format from becoming too unwieldy.

-----------------------------------------------------------------------

4. I think the intermedite format should be based on XML.

I think this simply because XML is a rapidly emerging standard which
was developed with precicely this kind of appliction (sematic markup)
in mind. A large number of tools related to XML is already available,
including some Haskell ones.

Best regards,

/Henrik

-- 
Henrik Nilsson
Yale University
Department of Computer Science
nilsson@cs.yale.edu