Re: Plain text: Amendment 1

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 02 1999 - 14:32:11 EDT


The problem I am having with Frank's suggestions boil down
essentially this:

The Unicode concept of plain text is of a text stream consisting
only of Unicode characters, interpreted according to the rules
of the standard, and not including (or not interpreting the
inclusion of) higher-level markup, however expressed. It does
not involve specification of particular font behavior (including
monospacing), details of terminal interaction, or line length.

It is that concept of Unicode plain text that we intend and
hope will be stable for the next century. Given the text stream
itself, basic textual content should be derivable, although not
necessarily any detailed layout information.

The intended invariant is textual content, rather than document
form including textual content.

To specify invariant document form, it is clear that a higher-level
protocol must be specified. And I see Frank's Unicode plain text
proposal as just the bare-bottom, minimal common denominator for
a document description standard. In that respect it is no different
from PDF, except in complexity and faithfulness to original appearance
of a document in all details.

Some of the difficulty of this discussion, of course, derives from
the fact that the Unicode Standard unavoidably had to contain some
bare minimum of format control characters. We have had to specify
format semantics for CR, LF, TAB, VT, FF because there was no way
we were going to get from the past to the future without people
converting existing documents using these (or carrying analogous
practice into new documents); and LS and PS were added to provide
a minimum, unambiguous set of format controls to organize plain
text. Bidi format controls were added because they had to be: otherwise,
you run into situations where intended content is inexpressible, or
existing content is uninterpretable in plain text.

And on the other hand, the situation is muddied by plain text markup
conventions where the markup is carried around in the plain text:

<TR>
<TD>9/23/98</TD>
<TD>38 widgets sold</TD>
<TD align=right>65,416</TD>
<TD align=center>---</TD>
<TD align=right>65,416</TD>
</TR>

Where the "plain text" is:

"<TR>NLF<TD>9/23/98</TD>NLF<TD>38 widgets sold</TD>NLF<TD ali
gn=right>65,416</TD>NLF<TD align=center>---</TD>NLF<TD align=
right>65,416</TD>NLF</TR>"

But the plain text of the content is 5 strings:

"9/23/98" "38 widgets sold" "65,416" "---" "65,416"

And the full document desription is, of course, not just these
5 strings, but includes the fact that they constitute a row embedded in
a table, and are aligned in specified ways within the cells in that row.

The Unicode vision is that the character encoding standard itself
should be as robust and useful in its larger domain as the 7-bit ASCII
standard was in its own contrained textual domain.

But given the enormous complexities that are inherent in trying to
deal with *all* of the writing systems of the world, it is inevitable
that plain text *layout* conventions involving Unicode are going to
be considerably more complex than plain text *layout* conventions
involving ASCII only. At the bare minimum, for example, plain text
in Unicode *must* take bidirectional layout into account--otherwise,
you would be saying that you could express Unicode content in plain
text, as long as you avoided Hebrew, Arabic, and Syriac characters.

In some respects, the entire content of the Unicode Standard beyond
just the code charts and names lists is an elaborate attempt to
describe what it means to deal with plain text layout and interpretation
for all of the Unicode characters. It cannot be encapsulated in
the kind of constraints that Frank has suggested, in my opinion.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT