Re: Plain Text

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Fri Jul 02 1999 - 23:18:47 EDT


At 08:58 -0700 7/2/1999, Frank da Cruz wrote:
[failing to mention that Ed Cherlin wrote:]
>> The problems we have with ASCII plain text come mainly from a small set of
>> common variant practices.
>>
>> Using CR, LF, or CR/LF as a line or paragraph end
>> Different tab spacings
>> Optional line wrap
>> Formfeed codes vs. computed page breaks
>> BS = DEL or BS-overstrike
>>
>We all have dealt with these annoyances throughout our careers. They are
>indeed annoying, but not impassible impediments. Also, let's not mix up:
>
> . File storage format
> . Interchange format
> . Data entry format
  . Rendering options

On looking through the remainder of this message, I conclude that I
disagree with Frank's attempts to make his own limited experience
normative, but I heartily agree that his proposal for a bottom-level plain
text Unicode format is on the right track, and that it allows us to deal
with some of the issues listed above as file format issues, specifically
line and paragraph ends and other control codes. Tab stops, wrapping, and
page breaking must be left to the user's choice when rendering, since they
are not file format issues.

>> Using CR, LF, or CR/LF as a line or paragraph end
>>
>As a line end:
> This is a file storage issue.
>
>As a paragraph end:
> There is no such thing as a paragraph end or paragraph separator in
> traditional plain text.
>
>Here I am sitting at my VT100 terminal, which is plugged in to my UNIX
>computer.

Here *I* am, sitting at my Mac, and recalling what I have been doing on an
NT system and Silicon Graphics Indy and O2 computers running Irix for the
last year and a half, when I was shuttling files back and forth between
them. (The Indy is used as an embedded controller in a 750 kg laser
microscope system for semiconductor wafer inspection, and the O2 to run the
microscope software without the hardware for demos and simulations, none of
which matters to this discussion.)

>I type:
>
> This is a line
>
>Then I push the Return key (sometimes marked Enter), which sends a Carriage
>Return.

Whereas my VT100 simulator used to get its CR from the keyboard buffer,
where it was deposited after the keyboard driver translated from the
keyboard scan codes. Anyway, input technology is not at issue here.

>I would enter a line in exactly the same way no matter what
>computer was on the far end of the wire. Now:
>
> . The UNIX terminal driver turns the CR into a LF before giving it
> to the application. If the application is storing the line into a
> file, the file gets "This is a line<LF>". Ditto for some other
> operating systems, like AOS/VS.
>
> . If I had OS-9 on the far end, it would store "This is a line<CR>".
                 ^or Mac OS

> . If I had TOPS-10, TOPS-20, RT-11, etc, on the far end, it would
> store "This is a line<CR><LF>".
>
> . If I had VMS, VOS, VM/CMS, MVS/TSO or other complex file system on
> the far end, who knows how the line would be stored -- it depends on
> chosen the file organization and record format.
>
>The point is, it doesn't matter. Each platform has its own format for
>internal use, but a standardized interface to the outside world. To further
>demonstrate this fact, if I then tell the computer on the far end to "type"
>or "cat" the file, it will, invariably, send:
>
> This is a line<CR><LF>

Your cultural ignorance/sheltered life-experience is showing. *You* may
live in an environment where these changes are made automatically, but a
lot of us don't.

>So who cares what the file format is -- except of course when we want to
>transfer the file to another platform.

And since I don't use a VT100 simulator anymore, I only encounter this
issue when transfering files to another platform, and as a result I care
all the time.

>In that case, it is the
>responsibility of each file-transfer agent

When reading floppy disks?

>to convert between its peculiar
>local format and the common one. And that is exactly what they do, just
>as is done at the terminal/terminal-driver/data-entry level. FTP and Kermit
>are two examples that show it is not that hard to convert plain-text file
>record formats from one platform to another. (And in Kermit's case, the
>character set too.)
>
>Of course life would have been simpler if there had been only ONE standard
>text-file format used on all platforms. But the early days of computing
>was a time of "Let the Hundred Flowers Bloom", and they did. Now, however,
>we are in a position to start over, and it is an opportunity we are not
>likely to have again.

Yes, yes, everything *could* have been made to work, except for the parts
that couldn't, you see, because management wouldn't allow the extra time
and space required to make things portable, or worse still, was trying to
lock customers into proprietary data formats.

>> Different tab spacings
>>
>I used to say this too, but the last platform I know about that did not
>assume tabstops at 1,9,17,25,... was MULTICS. Of course tabs are variable
>in word processors, etc, but that is not plain text.

Your limited experience again. I have rarely used an editor with fixed tab
stops since about 1982 (EDLIN, IIRC). I once knew the escape sequences for
IBM, Diablo, and Qume *printing* terminal tab settings by heart.

>> Optional line wrap
>>
>This is a feature of the terminal or the application, not of "plain text".

This is a feature found in ASCII *files* which were written either with or
without explicit line breaks, requiring a choice for appropriate
rendering--a choice which the editor should be able to make, but which the
user should actually make.

>Files that do not contain line breaks and must rely on some form of
>postprocessing to insert line breaks at appropriate points is not really
>plain text, it is "input for a text formatter".

But the text editor is frequently the chosen text reformatter. You are
still claiming that text files as they occur in your computer subculture
are for some reason normative for the rest of us.

>Prior to the advent of
>word processors, the idea of "long line as paragraph" never came up.

Word processing began in the 1960s. I gather you had a later date in mind.
Did you mean specifically WYSIWYG word processors, invented at Xerox in the
late 1970s?

>> Formfeed codes vs. computed page breaks
>>
>Page breaks are an issue worth discussing, and we discussed them at some
>length two years ago. Basically, you can let your "rendering engine" or
>printer driver insert them for you, or you can insert them yourself. One
>should be allowed the choice. (Why would anybody want "hard" page breaks?
>Because they are printing paychecks, invoices, envelopes, etc.)

If we can establish that general principle and apply it to the previous
cases, the problem will be solved in short order. The application
determines the requirements for tab stops, page breaks, and paragraph or
line formatting.

>> BS = DEL or BS-overstrike
>>
>This is a data entry issue, unless you mean including BS in a file for
>overstriking. But in that case, there is never any confusion between BS and
>DEL, since DEL is never used for that purpose. In other words, the only
>confusion is at data entry, and this is entirely irrelevant to the
>definition of plain text.
>
>> >Lines are terminated at somewhere between 72 and 80 characters by
>> >convention, because that's how wide terminal screens are, and before them
>> >the Teletype carriage, and before that the most common kind of punchcard.
>> >Or for that matter, typewriters and sheets of paper (A4 or US, take your
>> >pick :-)
>> >
>> >To this day, we follow these conventions in newsgroups and email, although
>> >now it might be more a matter of "netiquette" than necessity (as in the
>> >BITNET days, when e-mail was, quite literally, 80-column card images).
>>
>> As long as e-mail readers cannot correctly reformat messages with bad
>> line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> (like this), it will be a matter of real necessity.
>>
>What does "correctly reformat messages" mean? How can your mail client read
>my mind? How does it know that the message I sent you was not already
>formatted exactly the way I wanted it?

I mean that it should have the ability to reformat such badly broken text,
to use when I decide. Right now I have to reformat such text by hand, or
leave it severely broken. Well, maybe I should learn Perl, but I prefer
that someone else learn Perl and write the routines I and many others need.
If any reader is interested, the spec is as follows.

1) Reflow paragraphs, removing extra white space, while preserving quoting
marks '>' in the left margin. Don't get confused by angle brackets in the
text.

2) Realign tables with "tab damage". Tables that are too wide should be
broken into pages, rather than having lines folded.

If you can manage those two, you're good, and I have some more little jobs
for you. E-mail users will be eternally grateful (for a week or two,
anyway, on Net time).

>Notice that to illustrate my point, I need your original formatting (above)
>preserved, with the "> " quote indicators added at the left margin, and with
>my emphasis added under the appropriate words. What is a "correct" mail
>client supposed to do with this? Something like this?:
>
> > As long as e-mail readers cannot correctly
> reformat messages with bad > line breaks
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this),
> it will be a matter of real necessity.
>
>No, a correct email client will leave it alone. Whether I want my email
>reformatted by your client should be my choice, since only I know what my
>intentions are in sending it. ^^^^^^^^^

However, it actually is the recipient's choice, and you can't stop us. The
"correct" reformatting I had in mind would look like this.

>> As long as e-mail readers cannot correctly reformat messages with bad
>> line breaks (like this), it will be a matter of real necessity.

or possibly

>>As long as e-mail readers cannot correctly reformat messages with bad
>>line breaks (like this), it will be a matter of real necessity.

(**my choice**)

>Granted, plain text requires some minimal level of agreement, for example
>that your screen is 72 (or 76, or 79) columns wide. I maintain that this
>convention is universal, except for Kanji, etc, which are displayed in two
>character cells each. People who use email, netnews, and other forms of
>open, interplatform communication have learned these conventions. We use
>them ourselves on this mailing list. Those of us who do not are often
>excoriated for our antisocial behavior.

Universal, of course, except where it isn't, you know. No matter where we
set the right margin, text quoted from e-mails will break against it if it
can't be reflowed.

>Especially when we send email or netnews in some application-specific
>format, assuming that everybody else uses the same platform and applications
>we do.
>
>> >These simple conventions let us format our text exactly the way we want
>> >to. We can indent or not, we can put line breaks where we want them, we
>> >can have columns of numbers or other tabular presentations, mathematical
>> >expressions,
>>
>> which actually require several hundred non-ASCII characters, unless you
>> mean, as so many do, arithmetic expressions.
>>
>Yes, that's what I meant, thanks. (All of us here recognize the
>shortcomings of ASCII -- that's why we're here! But let's not forget that
>ASCII can be used to write, say, Fortran programs that can handle far more
>in the way of mathematics than the repertoire of ASCII might suggest, and
>that people send Fortran-like expressions back and forth in email, etc,
>which could easily lose their meaning when reformatted.)

How do you express a vector inner product in FORTRAN? In TeX it's something
like $\Sigma_(i=0)^n a_i \times b_i$, and in APL it's nearly "A+.xB", but
with a real times symbol.

>> When I want my text to stay as I wrote it, I put it into a PDF, not a text
>> file. Others prefer TeX for this purpose, or PostScript.
>>
>My point exactly.

No, your point was that ASCII text files stay formatted the way you write
them. That would be true, I suppose, if we agreed with you that we could
outlaw differences in tab stops, line breaking, and other options on
different platforms, because your subworld is normative and there aren't
any variant practices worthy of consideration.

>And how do I read your PDF if I don't have a PDF reader?
>(Don't say "get one" -- I'm reading your mail on a DOS PC or a PDP-11, or a
>Cray supercomputer.)

Yes, we had the same problem with SGI Irix 5.2, which doesn't support a PDF
reader. But the field engineers have Windows on their laptops, so it's only
a problem for the user manual, not the service manual, and only becomes
vitally important in paperless fabs.

>How do I read TeX if I don't have the software? How
>do I read PostScript if I don't have a PostScript printer or rendering
>engine. But the crucial point is:
>
> How will I read your PDF file 200 years from now, when
> PDF itself has been consigned to the "legacy" trashheap
> for the past 195 years?

along with ASCII, 8859, and 2022, and all of our removable storage media.
Do you know someone with a functioning Teletype paper tape reader who can
read legacy ASCII files from 1970? What would you suggest I archive my
life's work on for the ages to come (if anyone cares)?

>> We raised the question of defining a Unicode plain text format about two
>> years ago, but nothing seemed to come of it.

>Then let's try again. Let me get the ball rolling with the following simple
>suggestion for Unicode Plain-Text File and Interchange Format:
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following discusses a file format and a number of rendering options,
but fails to address interchange. UTF-8 is usually recommended for
interchange, since it avoids the Endianness question, but transfer of files
in other encodings will occur, and must be provided for.

The file format must define permitted character codes and code sequences. I
suggest that we permit any character code that can represent a character,
even if no character is defined for that code, but that we not permit
unmatched surrogate characters or codes which are defined not to have the
possibility of representing a character. Error behavior for the rendering
process when there are illegal codes or code sequences can be undefined, or
we could specify error messages and continuation policies.

The display rendering process does not change the file, so any display
options such as word wrap, tab stops, character width, ligatures, combining
characters, and so on are orthogonal to the file format. The user can
change the text and save in the new form, but the software isn't allowed to
on its own.

Rendering behavior of control codes and other non-printing characters must
be defined.

>A monospaced character-cell display device is assumed for the purposes of
>line breaking. Characters that are too wide for a character cell (such as
>Kanjis) occupy a double-width cell.

Users may choose to display all characters in cells of the same width, or
to mix single- and double-cell display. Note that this is not the same as
half-width and full-width CJK characters, which have been defined as
separate characters.

>Of course, Unicode Plain Text can also
>be displayed on any other kind of device, in any font, monospaced or not, in
>which case "all bets are off", just as they are now with traditional plain
>text when displayed in a proportional font.

Specifically, we will permit rendering in ATSUI on the Mac, in Java, on
NT2K, in Plan 9, and on other platforms, all with whatever level of Unicode
rendering and fonts happen to be available, and we will specify what should
happen for missing characters, lack of BIDI capability, lack of ligatures,
etc.

>Conversely, it is recognized that a monospaced (or duospaced) character-cell
>device might be inadequate for display of certain writing systems, such as
>Arabic or Indic scripts, and in this case intelligent rendering engines
>might very well be required.

For some purposes a monospaced LTR rendering of these characters may be
useful, and is permitted as a user option and as a fallback.

>This should, nevertheless, be possible with
>plain text, without the aid of any particular markup scheme.

But with the use of Unicode markup characters, such as explicit ordering
and joining characters.

>Plain text is composed only of Unicode characters,
                                       ^printing ^including surrogate
character pairs,

>with no meta-level
>of formatting information, presentation hints, etc, except:
>
> 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g.
> adjacent spaces are not collapsed).

including spaces defined at code points U+2000-U+200B.

> 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab
> stops shall be assumed every 8 columns, starting at the first. (This
> provision is primarily to facilitate conversion of ASCII and 8-bit
> text to Unicode. Alternatively, it would be OK to force all
> horizontal alignment to be accomplished by spaces.)

As on a typewriter, we have no control of the user's tab stop settings. I
recommend that we legislate alignment of monospaced text using spaces only,
and forget HT. That's what I have taught people to do for tabular e-mail
such as resumes.

> 3. Line breaks are indicated by Line Separator, U+2028. Preformatted
> text must break lines at column 79 or less to avoid unwanted
> reformatting.

At present software is free to truncate long lines, wrap at the last
column, or word wrap. I would recommend that we forbid truncation and allow
the user to choose wrapping style.

>Column numbers are 1-based, relative to the left or
> right margin, according to the previaling directionality, with
> single-width characters as the counting unit. A line break is
> required at the end of the final line if it is to be considered a
> line. (This is to allow append operations to work in the expected
> fashion.)
>
> 4. Paragraph breaks are indicated by two successive Line Separators

legacy, deprecated in new software

> or by Paragraph Separator, U+2029.
>
> 5. Hard page breaks are indicated by FF, U+000C.

  6. BIDI modifiers: U+200E, LEFT-TO-RIGHT MARK; U+200F, RIGHT-TO-LEFT MARK

  7. Joining modifiers: U+200C, ZERO-WIDTH NON-JOINER; U+200D ZERO-WIDTH JOINER

  8. Combining characters: numerous accents; vowels in Hebrew, Arabic,
Indic scripts, etc.

  9. FEFF ZERO-WIDTH NO-BREAK SPACE=BYTE ORDER MARK should be the first
character in a Unicode text file in 16-bit encoding (is that UTF-16? I
can't keep them all straight.) BOM is not required in UTF-8 encoding.

Non-normative comment:
>C0 and C1 control characters other than HT and FF have no function
>whatsoever in Unicode Plain Text. (If there were Unicode Horizontal Tab and
>Page Break characters, we wouldn't need C0 at all; however, the UTC -- or at
>least members of it, in previous discussions -- indicated that there is no
>good reason to duplicate the C0 characters that are already in Unicode.)
End comment.

>A Unicode plain-text "rendering engine" shall not mess with the format of a
                                                   \\\\\\\\\change

>plain-text file except, optionally, at the user's discretion, to wrap lines
                \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. It may

on the display
>that are longer than the display or printing device. Higher-level rendering
                                                    ^line length

>engines, of course, can do whatever they want.

And plain text can contain any markup for such engines using Unicode
characters that is defined for a specific use, such as HTML, TeX source
code, RTF, etc.
>- Frank
Ed

The following non-printing characters may occur in the file, but will be
treated as unavailable characters.
     U+206A INHIBIT SYMMETRIC SWAPPING
     U+206B ACTIVATE SYMMETRIC SWAPPING
     U+206C INHIBIT ARABIC SHAPING
     U+206D ACTIVATE ARABIC SHAPING
     U+206E NATIONAL DIGIT SHAPES
     U+206F NOMINAL DIGIT SHAPES
Unicode Standard 2.0 describes them as "Alternate format characters (usage
strongly discouraged)"

Behavior for unavailable characters should be defined. Options include a
single glyph for any unavailable character, glyphs indicating the code
block of unavailable characters, and numeric rendering.

Behavior for non-printing characters with no semantic significance in plain
text should be defined. Should they be treated as unavailable characters,
or as though they aren't there?

A growing number of standards specify the use of Unicode text files,
without explicitly defining them. If we get anywhere with this, we will
have to run our proposal past these other groups, including the IETF, the
POSIX committee, programming language standards committees, etc.

--
Edward Cherlin                        President
Coalition Against Unsolicited Commercial E-mail
Help outlaw Spam.       <http://www.cauce.org/>
Talk to us at             <news:comp.org.cauce>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT