[Haskell-cafe] Fwd: Can I use String without "" in ghci?

Richard A. O'Keefe ok at cs.otago.ac.nz
Wed Sep 4 08:01:08 CEST 2013


On 3/09/2013, at 10:44 PM, Rustom Mody wrote:
> Whoops! my bad -- I was *thinking* 'pipes' but ended up *writing* 'IPC'   :-)
> 
> So let me restate more explicitly what I intended -- pipes, FIFOs, sockets, etc.
> IOW read/write/send/recv calls and the mathematical model represented by the (non-firstclass) pair of C data structures in those functions: <buf, len> (or count).

Yes, but none of these have anything to do with strings.

"string" has a precise meaning in C:
7.1.1#1
	A string is a contiguous sequence of characters
	terminated by and including the first null character.   The
	term multibyte string is sometimes used instead of emphasize
	special processing given to multibyte characters contained
	in the string or to avoid confusion with a wide string.  A
	pointer to a string is a pointer to its initial (lowest
	addressed) character.  The length of a string is the number
	of characters preceding the null character and the value of
	a string is the sequence of the values of the contained
	characters, in order.
7.1.1#6
	(same as #1 but string->wide string and character->wide character)

If you are going to claim Humpty-Dumpty's privilege,
we cannot have a meaningful discussion.

Let me propose a more general definition of "string" which is
consistent with the three kinds of string natively supported by
the C <string.h> library and the four or five alternatives I've
personally used in C, with AWK, Python, JavaScript, Java, Erlang,
Ada, Smalltalk, Objective C, and PL/I.  (Haskell gets a little
fuzzy here.)

	A *string* is a *completed* sequence of characters
	which may be traversed in the natural order *and others*.

In this definition, there are four key aspects:

 - CHARACTERS.  The elements of a string belong to some finite
   set whose elements we have agreed to regard as representing
   characters.
 - SEQUENCE.  The implementation might be a multiway tree, a
   "piece table", an AVL DAG, or something more exotic, but there
   is a privileged view of it as a sequence.
 - COMPLETED.  For any particular state of the string, there is
   some present fact of the matter about what the length is and
   each element is known.
 - TRAVERSAL.  You can go from the beginning of the string to
   the end.  And you can go back again.  Palindrome tests are easy.

Now here's another definition:

	A (byte, character) *stream* is a *non-completed*
	sequence of (bytes, characters) which may be traversed
	once in the natural order; repeated traversal, and
        other traversal orders, need not be possible.


Now let us look at pipes, FIFOs, sockets, &c.
 
These things aren't even close to being strings.
They are BYTE STREAMS.

- The contents are *not* characters, they are *bytes*.
  It was and remains common practice to read and write *non-textual*
  data using these interfaces.  There are portability issues in
  transputting binary data in native format, but there are serious
  performance advantages to doing so.  Interfaces like XDR make it
  straightforward to read and write arrays and records and trees
  with never a character in sight.

  The fact that the external data are *byte* sequences rather than
  *character* sequences is the reason that we now have a problem
  with having to specify the encoding of an external stream when
  we *want* characters.  In another mailing list I'm on a problem
  came up when a data set from the US government was proclaimed
  to be ASCII but was in fact Windows CP1250 (or some such number)
  and the receiver's system didn't _have_ any locale that could
  decode that code-page.

  To put that another way, given an external byte sequence
  accessed using pipes, FIFOs, sockets, &c, if it is to be
  interpreted as bytes, there is no question about what its
  contents are, but if it is to be interpreted as characters,
  the information needed to discern what the characters _are_
  is as a rule not in that byte sequence.

- The buffers transferred in a read() or write() call are not
  strings either.  They are *chunks* of a byte sequence.  (Oh,
  and if you have wide character data inside your program, it
  is very likely to be a bad idea to transput them directly
  this way.)  write(fd, &record, sizeof record) is not uncommon.

- The size of an external byte sequence accessed using pipes,
  FIFOs, sockets, &c is not knowable through that interface.
  The information can be conveyed by some other means, but
  it cannot be trusted.  (I could _say_ that there are 400 bytes
  but _send_ 500, or 300.)  The interface is a *stream*
  interface, not a *string* interface.

- Only forward traversal is possible.
> 
> As an aside: modern usage types the buf as void * .  The version 7 unix manuals on which I grew up (and first edition of K&R), there was no void; buf would be just 'char *buf; '

Version 7 did have void but did not have void *.
Since void * and char * are required to have identical representations,
this is a distinction without a difference.  The point of the change
was simply that any object pointer type can be converted to or from
void * **without a cast**; using void * here is just POSIX telling the
C compiler not to do any serious type checking.
> I realize this is a terminology issue:
> 
> My usage of terminology like string/file are evidently more aligned to
> http://en.wikipedia.org/wiki/Vienna_Development_Method#Collections:_Sets.2C_Mappings_and_Sequences
> file(chap 4): http://red.cs.nott.ac.uk/~rxq/files/SpecificationCaseStudies.pdf

No, your use of 'string' is *not* well aligned with VDM.

	"For example, the type definition

	 String = seq of char

	 defines a type String composed of all finite strings of characters."

This is precisely the way I am using "string": a *completed* (hence *finite*)
sequence of characters that can be traversed more than once and in more than
one way (s(i) in VDM).

pipes and FIFOs and sockets might or might not be finite.
The output of the classic UNIX 'yes' command is not bounded,
for example.

The distinction between strings and streams is a very important one.

I have seen a programming language standards committee try to
demand that any "file" should be able to answer its length,
apparently unaware that in UNIX /dev/tty is a "file" but has no
definite "length" (and even the read()-returns-zero-at-EOF hack
doesn't work; having received such a signal you can just keep on
reading).

> Contrariwise 'file' can mean http://en.wikipedia.org/wiki/Data_set_%28IBM_mainframe%29

I have been familiar with IBM data sets for enough years to prefer the spelling
with the space it it.  They aren't strings, but they are completed
multi-traversable sequences of records.  Records might or might not be strings.

(I have seen that same programming language standards committee try
to demand that any "file" should be positionable at an arbitrary byte,
and this despite including members who habitually used VM/CMS and
others who habitually used VMS.)

So it is *true* that the UNIX innovation was to take "BYTE STREAM"
as a lingua franca between programs, but it is *false* that it used
strings.

> So let me restate (actually I didn't state it earlier!) my point in this example:
> 
> When Intel introduced these instructions in 8008 (or whatever) decades ago, it seemed like a good idea to help programmers and reduce their burden by allowing them to do some minimal arithmetic on data without burdensome conversion-to-binary functions.

Conversion to binary is not burdensome, and is not the issue.
The issue is getting the *flags* right for decimal arithmetic.

Intel's 4004 was for calculators.  Intel's 8008 was redesigned to
be more useful for calculators.  Intel's 8080 had actual hardware
support for decimal arithmetic (the Auxiliary Carry flag).  And
the 8086 was intended to be compatible with the 8080 and 8085.
Not binary compatible, but source to source assembler translation
is supposed to be straightforward.  The DAA instruction comes from
the 8080.

> 4 decades on and (Intel's very own Gordon) Moore's law ensuring our machines and networks some 7 orders of magnitude larger, the cost-equations look different.  printf and scanf are a basic given in any C library so optimizing them out does not optimize anything.

That's a pretty massive non-sequitur.  I speeded up my Smalltalk->C compiler by
a factor of 2 by eliminating printf().  The reason why printf() is slow has of
course *nothing* to do with number conversion: it has to do with run-time parsing
of formats.
> 
> On the other hand having instructions --that too 1-byte instructions -- that are almost never used is terribly inefficient:

"Terribly" inefficient?  I doubt it.  I doubt it very much indeed.
(Where is Andy Glew when you need him?)

One-byte instructions are not really any more precious than others.

>- the extra transistors in the millions of CPUs that are never used

The whole 8080 was implemented in about 6000 transistors.
The 8086 had about 29000.
A quad-core Intel Core i7 has 731,000,000.

"The extra transistors" required to support decimal arithmetic on a
modern CPU are presumably about 1/100,000th of the total.
_This_ is to worry about?

> - the instructions that are used become fatter.  Multiply by the GBs per installation multipled by millions of installations.

We have a wide range of techniques to deal with that.
In any case, if your program is compiled for 64-bit execution,
the decimal instructions aren't _there_ and _don't_ "fatten up"
any other instructions.


> you capture pithily in your 'Strings are wrong!' Put slightly more verbosely:
> 
> Strings (or byte-arrays if you prefer) are invariably what come into and go out of your program.

Nope.  *Streams*.  And like I said, PowerShell shows that from a practical
programming point of view, it _could_ be objects.

> Brings me to the OPs question:
> 
> I want to know if it is possible that I use strings without "".
> 
> If I type
> Prelude>foo bar
> which actually I mean
> Prelude>foo "bar"
> However I don't want to type ""s.
> 
> I have noticed if bar is predefined or it is a number, it can be used as arguments. But can other strings be used this way? Like in bash, we can use ping 127.0.0.1 where 127.0.0.1 is an argument.
> 
> If not, can foo be defined as a function so that it recognize arguments like bar as "bar"?
> 
> Its not clear what your use-case is.

Now we are in agreement.

Alan Perlis, epigram 34:

	The string is a stark data structure and
	everywhere it is passed there is much duplication of process.
	It is a perfect vehicle for hiding information.







More information about the Haskell-Cafe mailing list