Re: Plain Text [**NOT**]

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Fri Jul 02 1999 - 18:20:10 EDT


There is no point in continuing to argue about these matters. There are
numerous variants of ``plain text'' used in different ways, and no amount
of shouting that it has to be your way or nothing will change the fact that
other people have never even heard of your way as an alternative to their
way.

But keep going anyway. I'm collecting your rants to use as illustrations in
some things I am planning to write (with names suppressed to protect the
guilty, of course).

At 13:06 -0700 7/1/1999, Frank da Cruz wrote:
>> Am 1999-06-30 um 14:17 h PDT hat Markus Kuhn geschrieben:
>> > The only thing that is clear about "plain text" is that it is not well
>> > defined at all.

Hear, hear.

>> Am 1999-06-30 um 15:32 h PDT hat Frank da Cruz geschrieben:
>> > Actually, it tends to be well-defined for each platform.

If that were true, it would make the point that it is not well-defined
overall, since cross-platform file transfer is the key issue, whether for
ASCII or for Unicode.

>> In MS-DOS (or PC-DOS and other DOS variants) on the PC, it is not
>> well defined, at all:
>>
>Not to prolong this discussion, which took place once before, at great
>length, in May to July 1997...
>
>> - '0D0A'x (CR+LF) means either line-break or pararaph separator,
>>
>When/if it means pararaph separator it's not plain text. Plain text is
>what you TYPE at the DOS prompt. In such files (e.g. a READ.ME file)
>CRLF means Carriage Return (move the cursor to the left margin) and
>Line Feed (move the cursor down one row).

Which is precisely what I don't type. I type Enter. I know of no device
which required the user to enter a CR followed by an LF, although I have
used several which permitted it.

>> - '09'x (HT) means either a tabulator (and nobody knows where the
>> tab positions are supposed to be) or a line-break,
>>
>In DOS, when you TYPE a file at the DOS prompt, a Tab character is expanded
>to enough blanks to bring us to the next tab stop, which are set according
>to the most common convention: 1, 9, 17, ... (1-based).

I have used editors that defaulted to other tab lengths, and that allow the
user to set tabs at intervals of 4-8 (which is annoying, since I always
want three spaces for indenting code). The existence of a single program,
or even a hundred programs with a specific behavior does not make that
behavior standard.

>> - '1A'x (SUB, aka Ctrl-Z) either means end of text, or a
>> right-pointing arrow; when it is used as an end-of-text marker,
>> the remainder of the storage block may contain arbitrary characters
>> with some programs and must contain '00'x with other programs (nice
>> feature when one of the former writes a file one of the latter is
>> supposed to read).
>>
>That's not a plain-text issue, it's a character encoding and file format
>issue. Ctrl-Z as an EOF indicator is a relic of CP/M, carried forward into
>DOS for compatibility, used by some apps and ignored by others.

I think this must be the problem right here. I haven't the foggiest idea
what you mean by distinguishing plain text from character encoding and file
formats.
The entire argument has been about file formats on different platforms, and
character encoding, specifically the (many-to-many) relation between the
crude formatting operations possible on a Teletype and the control
character sequences needed to produce them.

>Two years ago I suggested that we come up with a standard for Unicode plain
>text that can be used as a baseline when converting files from DOS, UNIX, the
>Macintosh, etc, to Unicode, and that says what control characters (C0, C1,
>as well as Line Separator, Paragraph Separator, etc) mean in a plain-text
>file or data stream. We made some good progress but eventually the discussion
>fizzled out. If I can summarize it briefly:
>
> . Yes, but plain text in this sense is inadequate for representing
> (list of writing systems that need higher-level formatting assistance,
> rendering engines, etc.)
>
> . Fine, but they need that anyway. For many other languages, plain text
> is possible, and there should be no reason not to settle on a standard
> representation for it in those cases where it can be used.
>
>If anybody would like to revisit that discussion, I've uploaded it to:
>
> ftp://kermit.columbia.edu/kermit/e/plain.txt
>
>(about 300K of plain text :-)
>
>- Frank

Can you summarize the discussion in the format of a draft standard so that
we can discuss it as a real proposal? Now that adequate rendering engines,
IMEs, and character encoding translators are starting to appear as
operating system components we can discuss a text format that includes
Arabic, Indic scripts, Korean hangul, math, IPA and so on as Unicode plain
text, with some sort of fallback (LTR monospaced?) for output where proper
rendering isn't available.

I would also like to see hex/Unicode editors on the model of the old
hex/ASCII sector editors, but brought up to date so that they can display
text in any standard character encoding, and a variety of numeric formats
(at least binary, octal, hex, 16-, 32-, and 64-bit signed and unsigned
integers, and the IEEE floating point formats). HEdit (freeware for
Windows) would give you some idea of what I mean, if you imagine Unicode
support added. It shows data at the cursor in several numeric formats and
ASCII, with search and replace of different size data, and either hex or
character input. I use HEdit constantly to determine what formats unknown
files are in.

--
Edward Cherlin                        President
Coalition Against Unsolicited Commercial E-mail
Help outlaw Spam.       <http://www.cauce.org/>
Talk to us at             <news:comp.org.cauce>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT