Re: Plain Text

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Fri Jul 02 1999 - 12:02:27 EDT


> The problems we have with ASCII plain text come mainly from a small set of
> common variant practices.
>
> Using CR, LF, or CR/LF as a line or paragraph end
> Different tab spacings
> Optional line wrap
> Formfeed codes vs. computed page breaks
> BS = DEL or BS-overstrike
>
We all have dealt with these annoyances throughout our careers. They are
indeed annoying, but not impassible impediments. Also, let's not mix up:

 . File storage format
 . Interchange format
 . Data entry format

> Using CR, LF, or CR/LF as a line or paragraph end
>
As a line end:
  This is a file storage issue.

As a paragraph end:
  There is no such thing as a paragraph end or paragraph separator in
  traditional plain text.

Here I am sitting at my VT100 terminal, which is plugged in to my UNIX
computer. I type:

  This is a line

Then I push the Return key (sometimes marked Enter), which sends a Carriage
Return. I would enter a line in exactly the same way no matter what
computer was on the far end of the wire. Now:

 . The UNIX terminal driver turns the CR into a LF before giving it
   to the application. If the application is storing the line into a
   file, the file gets "This is a line<LF>". Ditto for some other
   operating systems, like AOS/VS.

 . If I had OS-9 on the far end, it would store "This is a line<CR>".

 . If I had TOPS-10, TOPS-20, RT-11, etc, on the far end, it would
   store "This is a line<CR><LF>".

 . If I had VMS, VOS, VM/CMS, MVS/TSO or other complex file system on
   the far end, who knows how the line would be stored -- it depends on
   chosen the file organization and record format.

The point is, it doesn't matter. Each platform has its own format for
internal use, but a standardized interface to the outside world. To further
demonstrate this fact, if I then tell the computer on the far end to "type"
or "cat" the file, it will, invariably, send:

  This is a line<CR><LF>

So who cares what the file format is -- except of course when we want to
transfer the file to another platform. In that case, it is the
responsibility of each file-transfer agent to convert between its peculiar
local format and the common one. And that is exactly what they do, just
as is done at the terminal/terminal-driver/data-entry level. FTP and Kermit
are two examples that show it is not that hard to convert plain-text file
record formats from one platform to another. (And in Kermit's case, the
character set too.)

Of course life would have been simpler if there had been only ONE standard
text-file format used on all platforms. But the early days of computing
was a time of "Let the Hundred Flowers Bloom", and they did. Now, however,
we are in a position to start over, and it is an opportunity we are not
likely to have again.

> Different tab spacings
>
I used to say this too, but the last platform I know about that did not
assume tabstops at 1,9,17,25,... was MULTICS. Of course tabs are variable
in word processors, etc, but that is not plain text.

> Optional line wrap
>
This is a feature of the terminal or the application, not of "plain text".
Files that do not contain line breaks and must rely on some form of
postprocessing to insert line breaks at appropriate points is not really
plain text, it is "input for a text formatter". Prior to the advent of
word processors, the idea of "long line as paragraph" never came up.

> Formfeed codes vs. computed page breaks
>
Page breaks are an issue worth discussing, and we discussed them at some
length two years ago. Basically, you can let your "rendering engine" or
printer driver insert them for you, or you can insert them yourself. One
should be allowed the choice. (Why would anybody want "hard" page breaks?
Because they are printing paychecks, invoices, envelopes, etc.)

> BS = DEL or BS-overstrike
>
This is a data entry issue, unless you mean including BS in a file for
overstriking. But in that case, there is never any confusion between BS and
DEL, since DEL is never used for that purpose. In other words, the only
confusion is at data entry, and this is entirely irrelevant to the
definition of plain text.

> >Lines are terminated at somewhere between 72 and 80 characters by
> >convention, because that's how wide terminal screens are, and before them
> >the Teletype carriage, and before that the most common kind of punchcard.
> >Or for that matter, typewriters and sheets of paper (A4 or US, take your
> >pick :-)
> >
> >To this day, we follow these conventions in newsgroups and email, although
> >now it might be more a matter of "netiquette" than necessity (as in the
> >BITNET days, when e-mail was, quite literally, 80-column card images).
>
> As long as e-mail readers cannot correctly reformat messages with bad
> line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> (like this), it will be a matter of real necessity.
>
What does "correctly reformat messages" mean? How can your mail client read
my mind? How does it know that the message I sent you was not already
formatted exactly the way I wanted it?

Notice that to illustrate my point, I need your original formatting (above)
preserved, with the "> " quote indicators added at the left margin, and with
my emphasis added under the appropriate words. What is a "correct" mail
client supposed to do with this? Something like this?:

> As long as e-mail readers cannot correctly
    reformat messages with bad > line breaks
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this),
    it will be a matter of real necessity.

No, a correct email client will leave it alone. Whether I want my email
reformatted by your client should be my choice, since only I know what my
intentions are in sending it.

Granted, plain text requires some minimal level of agreement, for example
that your screen is 72 (or 76, or 79) columns wide. I maintain that this
convention is universal, except for Kanji, etc, which are displayed in two
character cells each. People who use email, netnews, and other forms of
open, interplatform communication have learned these conventions. We use
them ourselves on this mailing list. Those of us who do not are often
excoriated for our antisocial behavior.

Especially when we send email or netnews in some application-specific
format, assuming that everybody else uses the same platform and applications
we do.

> >These simple conventions let us format our text exactly the way we want
> >to. We can indent or not, we can put line breaks where we want them, we
> >can have columns of numbers or other tabular presentations, mathematical
> >expressions,
>
> which actually require several hundred non-ASCII characters, unless you
> mean, as so many do, arithmetic expressions.
>
Yes, that's what I meant, thanks. (All of us here recognize the
shortcomings of ASCII -- that's why we're here! But let's not forget that
ASCII can be used to write, say, Fortran programs that can handle far more
in the way of mathematics than the repertoire of ASCII might suggest, and
that people send Fortran-like expressions back and forth in email, etc,
which could easily lose their meaning when reformatted.)

> When I want my text to stay as I wrote it, I put it into a PDF, not a text
> file. Others prefer TeX for this purpose, or PostScript.
>
My point exactly. And how do I read your PDF if I don't have a PDF reader?
(Don't say "get one" -- I'm reading your mail on a DOS PC or a PDP-11, or a
Cray supercomputer.) How do I read TeX if I don't have the software? How
do I read PostScript if I don't have a PostScript printer or rendering
engine. But the crucial point is:

      How will I read your PDF file 200 years from now, when
      PDF itself has been consigned to the "legacy" trashheap
      for the past 195 years?

> We raised the question of defining a Unicode plain text format about two
> years ago, but nothing seemed to come of it.
>
Then let's try again. Let me get the ball rolling with the following simple
suggestion for Unicode Plain-Text File and Interchange Format:

A monospaced character-cell display device is assumed for the purposes of
line breaking. Characters that are too wide for a character cell (such as
Kanjis) occupy a double-width cell. Of course, Unicode Plain Text can also
be displayed on any other kind of device, in any font, monospaced or not, in
which case "all bets are off", just as they are now with traditional plain
text when displayed in a proportional font.

Conversely, it is recognized that a monospaced (or duospaced) character-cell
device might be inadequate for display of certain writing systems, such as
Arabic or Indic scripts, and in this case intelligent rendering engines
might very well be required. This should, nevertheless, be possible with
plain text, without the aid of any particular markup scheme.

Plain text is composed only of Unicode characters, with no meta-level
of formatting information, presentation hints, etc, except:

 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g.
    adjacent spaces are not collapsed).

 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab
    stops shall be assumed every 8 columns, starting at the first. (This
    provision is primarily to facilitate conversion of ASCII and 8-bit
    text to Unicode. Alternatively, it would be OK to force all
    horizontal alignment to be accomplished by spaces.)

 3. Line breaks are indicated by Line Separator, U+2028. Preformatted
    text must break lines at column 79 or less to avoid unwanted
    reformatting. Column numbers are 1-based, relative to the left or
    right margin, according to the previaling directionality, with
    single-width characters as the counting unit. A line break is
    required at the end of the final line if it is to be considered a
    line. (This is to allow append operations to work in the expected
    fashion.)

 4. Paragraph breaks are indicated by two successive Line Separators
    or by Paragraph Separator, U+2029.

 5. Hard page breaks are indicated by FF, U+000C.

C0 and C1 control characters other than HT and FF have no function
whatsoever in Unicode Plain Text. (If there were Unicode Horizontal Tab and
Page Break characters, we wouldn't need C0 at all; however, the UTC -- or at
least members of it, in previous discussions -- indicated that there is no
good reason to duplicate the C0 characters that are already in Unicode.)

A Unicode plain-text "rendering engine" shall not mess with the format of a
plain-text file except, optionally, at the user's discretion, to wrap lines
that are longer than the display or printing device. Higher-level rendering
engines, of course, can do whatever they want.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT