Long live String = [Char] (Was: Re: String != [Char])

Sun Mar 25 15:16:37 CEST 2012

Hi all,

Thomas Schilling wrote:

 > OK, I agree that breaking text books is a big deal.  On the other
 > hand, the lack of a good Text data type forced text books to teach bad
 > approaches to dealing with strings.  Haskell should do better.

As far as I know, none of the introductory Haskell text books
has the ambition of teaching serious text processing in Haskell.
And what they do for simple text processing for purpose of illustration
is no worse than what one typically would do in, say, an introduction
to programming using any language, like C or Java.

So I don't buy that argument per se. But I do agree, of course,
that a good library for text processing, and with adequate language
support for making it convenient to use, is important.

 > Johan mentioned both semantic and performance problems with Strings.
 > A part he didn't stress is that Strings are also a horribly
 > memory-inefficient way of storing strings.  On 64 bit GHC systems a
 > single ASCII character needs 16 bytes of memory (i.e., an overhead of
 > 16x). A non-ASCII character (ord c > 255) actually requires 32 bytes.
 > (This is due to a de-duplication optimisation in the GHC GC).  Other
 > implementations may do better, but an abstract type would still be
 > better to enable more freedom for implementors.

Sure it's inefficient. I doubt the above is news to anyone on this list.
The point, though, is that once we're at the level of applications,
in most cases, this inefficiency is negligible.

And in the cases where it is not, the programmer will be well aware
of this and pick a better representation, or will learn about it
the hard way and be forced to pick a better representation. Just
as with processing of significant amounts of *any* data.

It simply isn't the case that the Haskell world magically would
be significantly better of in terms of performance of only everyone
was forced to use something like Text instead of String = [Char].

Moreover, the above analysis is unnecessarily pessimistic for one
(somewhat important case: string literals. Thanks to Haskell being
lazy, it is very easy if one really worry (for an implementor) to
arrange that string literals are stored very compactly in a binary,
only to be materialized when (and if) actually used. (I did just that
years ago in the Freja compiler: memory was significantly smaller in
those days, so I did worry :-)

 > Correct handling of unicode strings is a Hard Problem and String =
 > [Char] is only better if you ignore all the issues (which is certainly
 > fine a teaching environment).

Yes. Unicode is unfortunately (partly but not exclusively out of
necessity), very complicated. I doubt one would want to discuss
this in depth in any introductory programming course.

My point was that String = [Char] is fine as far as it goes. Not that
it should be the basis for serious string processing libraries.

 > I would be happy to have a simplistic String = [Char] coexist with a
 > Text type if it weren't for the problem that so many things are biased
 > towards String.  E.g., error takes a String,

Yes. That's a bias. But is it a problem? Here we're just talking
about getting a sequence of (possibly unicode) characters to stderr.

 > Show is used everywhere and produces strings,

Show and Read are mainly used for simplistic serialisation
and deserialisation. When ppl really care, they tend to use
more refined approaches, e.g. proper scanners and parsers,
or binary I/O. So again, while there is certainly a bias,
it doesn't seem like a genuine problem in most cases.

I can possibly see issues for conversion from and to e.g.
built-in numeric types and various string representations,
but I can't see why solving those would necessitate
getting rid of String = [Char]. Read and Show could be
overloaded on the string type, for example (at least
given multi-parameter type classes), and/or a bit of
compiler optimization ought to be enough to dispatch such
uses of "read" and "show" to appropriate primitives of e.g.
the Text library anyway.

 > the pretty printing library uses Strings,

But that is a library issue, not a language issue.

 > Read parses Strings.

See above.

The special status of Read and Show is questionable anyway.
Will hopefully be possible at some point to implement those
completely as libraries. So I'm not not overly swayed by the argument
of language bias in those cases.

 > As I said, while I'm not a huge fan of having two String types
 > co-exist, I could accept it as a necessary trade-off to keep text
 > books valid and preserve backwards compatibility.

While an undue proliferation of string types would be unfortunate,
compared with the plethora of other representational choices one
is faced with when it comes to e.g. numeric types, arrays, maps,
etc., a couple of string types doesn't seem like a big deal,
especially not if one is designated the default choice for
any program that will do non-trivial text processing or aims
at doing internationalisation properly.

 > (There are also other issues with String.  For example, you can't
 > write an instance MyClass String in Haskell2010, and even with GHC
 > extensions it seems wrong and you often end up writing instances that
 > overlap with MyClass [a].)  I'm using Data.Text a lot, so I can work
 > around the issue, but unfortunately you run into a lot of issues
 > where the standard library forces the use of String, and that, I
 > believe, is wrong.
 >
 > If changing the standard library is the bigger issue, however, then
 > I'm not sure whether this discussion needs to take place on the
 > haskell-prime list or on the libraries list.

Indeed. Maybe all that's really needed at the language level
is to standardize overloading of string literals? (In a way
that avoids issues like the ones described above.)

Best,

/Henrik

-- 
Henrik Nilsson
School of Computer Science
The University of Nottingham
nhn at cs.nott.ac.uk