Re: Line Separator Character

From: Edward Cherlin (
Date: Sat May 17 1997 - 15:55:54 EDT

"Martin J. Duerst" <> wrote:
>On Fri, 16 May 1997, Pierre Lewis wrote:
>> Context: plain text unicode file.
>There are basically two models of plain text. The first is line-oriented,
>the second is paragraph-oriented. Email or programm code is the traditional
>example of line-oriented plain text. Descriptive text as it appears in
>word processors, minus formatting, is the typical example of paragraph-
>oriented plain text.
>In traditional encoding (using CR/LF/CRLF) and in "official" Unicode
>encoding (using PS), the two models are made compatible by treating
>each line in the line-oriented plain text as a paragraph. On the other
>hand, the paragraph-oriented model can be reduced to the line-oriented
>model by splitting lines in a particular layout of the paragraph.
>This splitting is again done by paragraph separators (CR/LF/CRLF/PS),
>and not by LS.

There are actually several other models for files of 7-bit or 8-bit
character codes, commonly, but misleadingly, known as ASCII text files.

The original model was control of a Teletype machine, where several control
characters called for physical movement of the mechanism. Many of the bad
habits used in text files are survivals of this model. Others, fortunately,
have died out. (I am thinking of some of the uses of control characters in
editors meant for hard copy terminals.)

CRLF was *required* to initiate a new line, but CR by itself was sometimes
used for overstriking (if BS was not available), including underlining and
composition of APL characters, and also for imitating typewriter
overstrikes such as c| for the cent sign and some accented letters such as
u" or e`. HT and FF were very commonly used, and some others, such as SI
and SO, less so, but each of these specified a mechanical action. SI and SO
allowed a fairly standard way to control some dual-script devices including
ASCII/Arabic, ASCII/Cyrillic, APL/ASCII, and other combinations.

Many devices used ASCII control characters for new purposes, so that an
ASCII character string could specify the hardware behavior needed for bold
facing and so on. The actual process of printing might call for translation
from a 'text file' to an ASCII command string file which would produce the
same printed image by other means. For example, a printer driver for a
bidirectional printer could save time by printing alternate lines in
reverse order, with LF and some spacing commands between lines.

We then had the glass Teletype, or dumb terminal, model, which might treat
CR and LF as on mechanical devices, or might treat them both as new line
characters, or might do something else. At the same time, 'text files'
could still be used to control electronic printers, with varying
interpretations of some of the control characters.

Now, on computers with GUIs, we have different systems that expect CR, or
LF, or CRLF, as the new line signal, and have other interpretations of
other control characters. System software vendors are going off in all
directions inventing new misinterpretations of Unicode characters and
constructing yet other file designs.

We want to have a uniform, portable definition of the meaning of a file of
16-bit character codes interpreted as Unicode, or "Unicode text file" for
short. At the same time, we have several uses for such files, where
different interpretations may be desired. If we want to do this right, I
think we have to find the appropriate organization for defining such file
formats and uses, and get down to some serious and at times difficult
standard making. The Unicode character code standard does not seem to be
the right place to do this.

Edward Cherlin       Help outlaw Spam     Everything should be made
Vice President      as simple as possible,
NewbieNet, Inc.  1000 members and counting      __but no simpler__.    17 May 97   Attributed to Albert Einstein

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT