Re: Plain Text

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Thu Jul 01 1999 - 16:11:21 EDT


> Am 1999-06-30 um 14:17 h PDT hat Markus Kuhn geschrieben:
> > The only thing that is clear about "plain text" is that it is not well
> > defined at all.
>
> Am 1999-06-30 um 15:32 h PDT hat Frank da Cruz geschrieben:
> > Actually, it tends to be well-defined for each platform.
>
> In MS-DOS (or PC-DOS and other DOS variants) on the PC, it is not
> well defined, at all:
>
Not to prolong this discussion, which took place once before, at great
length, in May to July 1997...

> - '0D0A'x (CR+LF) means either line-break or pararaph separator,
>
When/if it means pararaph separator it's not plain text. Plain text is
what you TYPE at the DOS prompt. In such files (e.g. a READ.ME file)
CRLF means Carriage Return (move the cursor to the left margin) and
Line Feed (move the cursor down one row).

> - '09'x (HT) means either a tabulator (and nobody knows where the
> tab positions are supposed to be) or a line-break,
>
In DOS, when you TYPE a file at the DOS prompt, a Tab character is expanded
to enough blanks to bring us to the next tab stop, which are set according
to the most common convention: 1, 9, 17, ... (1-based).

> - '1A'x (SUB, aka Ctrl-Z) either means end of text, or a
> right-pointing arrow; when it is used as an end-of-text marker,
> the remainder of the storage block may contain arbitrary characters
> with some programs and must contain '00'x with other programs (nice
> feature when one of the former writes a file one of the latter is
> supposed to read).
>
That's not a plain-text issue, it's a character encoding and file format
issue. Ctrl-Z as an EOF indicator is a relic of CP/M, carried forward into
DOS for compatibility, used by some apps and ignored by others.

Two years ago I suggested that we come up with a standard for Unicode plain
text that can be used as a baseline when converting files from DOS, UNIX, the
Macintosh, etc, to Unicode, and that says what control characters (C0, C1,
as well as Line Separator, Paragraph Separator, etc) mean in a plain-text
file or data stream. We made some good progress but eventually the discussion
fizzled out. If I can summarize it briefly:

 . Yes, but plain text in this sense is inadequate for representing
   (list of writing systems that need higher-level formatting assistance,
   rendering engines, etc.)

 . Fine, but they need that anyway. For many other languages, plain text
   is possible, and there should be no reason not to settle on a standard
   representation for it in those cases where it can be used.

If anybody would like to revisit that discussion, I've uploaded it to:

  ftp://kermit.columbia.edu/kermit/e/plain.txt

(about 300K of plain text :-)

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT