Re: Line Separator Character

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Sat May 17 1997 - 18:39:39 EDT


> There are actually several other models for files of 7-bit or 8-bit
> character codes, commonly, but misleadingly, known as ASCII text files.
>
> The original model was control of a Teletype machine, where several control
> characters called for physical movement of the mechanism. Many of the bad
> habits used in text files are survivals of this model.
>
I wouldn't call them bad habits necessarily. The primary bone of contention
here is the distinction between LF and CR...

> CRLF was *required* to initiate a new line, but CR by itself was sometimes
> used for overstriking (if BS was not available), including underlining and
> composition ...
>
Right. And LF was used by itself to go down one row.

> We then had the glass Teletype, or dumb terminal, model, which might treat
> CR and LF as on mechanical devices, or might treat them both as new line
> characters...
>
Actually I think that practically all CRTs treat CR and LF just as the TTY
did. CR positions the cursor to the left of the current row, LF moves it
down one row.

> Now, on computers with GUIs, we have different systems that expect CR, or
> LF, or CRLF, as the new line signal, and have other interpretations of
> other control characters.
>
Really the problem started when the UNIX designers decided that it was good
idea to have a storage model that was different than the tranmsission model.
This allowed some space to be saved on disk, and it made text processing
software a bit easier to write. However, it complicated the tty driver by
requiring it to substitute CRLF for LF when displaying text files, which in
turn has led to all sorts of confusion about "raw" vs "cooked" mode, etc,
and the related distinction between NVT vs binary mode in Telnet protocol.

(It is a simplification that UNIX was the first disk operating system to store
textual files differently than it transmitted them, but it may have been the
first *stream-oriented* one to do so -- or at least the one we remember.)

Thus CRLF has always been the line terminator in ASCII (in the broad sense of
"not EBCDIC") text transmission. Systems that chose to use different internal
representations have had the obligation to convert back and forth during
transmission.

It's interesting to speculate how different the world (of computing) might be
today if only a few arbitrary and perhaps whimsical decisions had been made
differently decades ago: if UNIX and several other popular platforms had used
CRLF rather than LF (or CR) as the line terminator; if DOS had used "forward
slash" (/) rather than "backward slash" (\) as the directory separator... How
many person-eons of effort have gone into addressing the consequences of these
decisions...

> HT and FF were very commonly used...
>
(And still are...) Now there's an interesting point. Unicode has addressed
the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it
sometimes just as necessary to specify a hard page break as it is to specify a
hard line or paragraph break? I suppose there must be a boundary somewhere
between "Trust your rendering engine" and "Mother, Please! I'd rather do it
myself!" I don't have a copy handy, and I might be entirely wrong about this,
but isn't the Holy Koran a document that must be paginated in a specific way?

In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly
difficult for certain kinds of people to operate in the ways to which they
have become accustomed over the past decades in which plain text was "good
enough" save that one could not put lots of languages into it. For example,
today I can write a letter that spills over to one or more "second sheets" in
plain text and print it on a plain-text printer without a second thought,
using any software at all on any platform, embedding hard line, paragraph, and
page breaks in it, just as most of us still do with email (except for the page
breaks). No "templates", "wizards", "profiles", "preferences", or
"Buzzword-1.0 Compliance" involved. I can move this letter to practically any
other platform and it will still be perfectly legible and printable -- no
export or import or conversion or version skew to worry about. I think a lot
of people would be perfectly happy to do the same in a plain-text Unicode
world using plain-text Unicode terminals and printers, if there were such
things. But there's a bigger issue...

The idea that one must embed Unicode in a higher level wrapper (e.g. a
Microsoft Word document, or even HTML) to make it useful has a certain
frightening consequence: the loss of any expectancy of longevity for our new
breed of documents. These higher-level systems will be overwhelmingly
proprietary due to the vast amount of coding that must go into them, the
voracious nature of the marketplace, etc, and so formats will become obsolete
with ever-increasing frequency, and it will become ever harder to extract the
plain-text characters -- the substance -- from them. That which is perceived
at a critical moment in time to be worthy of preservation will be converted to
the new format, the rest discarded or left for decipherment by future
generations of information archaeologists. (If you don't believe this is a
problem, think about what is happening to our (physical) libraries all over
the world at this moment -- get ready to say goodbye forever to five millenia
of history that was not worth digitizing.) (And then to do it all over again
when the digital formats and media need conversion in another ten years.)
(And then again five years after that, etc...)

So let's do our part and make some effort to accommodate traditional
plain-text applications in Unicode, rather than discourage them :-)

- Crank (Oops, I mean Frank)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT