terminology: plaintext (was Re: unicode Digest V5 #149)

From: Gregg Reynolds (unicode@arabink.com)
Date: Fri Jun 24 2005 - 15:18:38 CDT

Next message: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"

Previous message: Tim Greenwood: "Re: Tamil Collation vs Transliteration/Transcription Enc"
In reply to: James Kass: "Re: unicode Digest V5 #149"
Next in thread: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Reply: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Maybe reply: Sinnathurai Srivas: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Maybe reply: James Kass: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

James Kass wrote:
> Gregg Reynolds wrote,
>
>
>>The unicode definition of "plain text" works for me; it's more or less
>>mathematical and allows us to avoid metaphysics. But you surely see
>>that the definition of "rich text" is hopelessly broken and inconsistent
>>with that of plain text, no?
>
>
> Surely I can see that the definition of rich text is inconsistent
> with that of plain text. After all, if they weren't inconsistent,
> they'd be the same thing and the glossary entry for "rich text"
> could be changed to: 'see "plain text"'.

consistent does not mean identical.
>
> But, what's hopelessly broken about it?
>

Hi James,

Sorry about getting back to you late.

I hope the following (longish) message will make clear I don't bring
this stuff up just to be curmudgeonly.

From the glossary:

"Plain Text. Computer-encoded text that consists only of a sequence of
code points from a given standard, with no other formatting or
structural information."

Not bad; but not good enough. It should say "a sequence of codepoints
*each of which has single-character semantics*...". I.e. a standard
which defines a codepoint for "red" or "skip 24 points" or "poodle"
cannot be used for plaintext.

"Rich Text. Also known as styled text. The result of adding information
to plain text. Examples of information that can be added include font
data, color, formatting information, phonetic annotations, interlinear
text, and so on. The Unicode Standard does not address the
representation of rich text. It is expected that systems and
applications will implement proprietary forms of rich text. Some public
forms of rich text are available (for example, ODA, HTML, and SGML).
When everything except primary content is removed from rich text, only
plain text should remain."

Most obvious problem: SGML is plain text, as is XML, a subset of PDF,
etc. HTML is also plaintext; it happens to have some formatting
semantics at the lexical level, but considered as a "sequence of
codepoints" it clearly meets the Unicode definition of plain text. For
that matter, isn't RTF plaintext with formatting semantics? I'm not
that familiar with it, but doesn't it use a plain text character repertoire?

The basic problem: by these definitions, plain text and rich text are in
semantically different categories. One is a sequence of code points;
the other is - what? Figure on ground? Ink on paper? Any result of
presenting plain text visually?

What can it mean to "add information" to plain text, given that plain
text is by definition a sequence of codepoints? If you add
"information" consisting of codepoints with character semantics, then
you still have plain text. If you add "information" consisting of
codepoints with non-character semantics, well then you no longer have
text of any kind. You have non-text. If you add "information" by
writing a syntax-coloring editor, you haven't added anything to the
plain text, you've added a completely separate semantic layer.

The fact that a plain text string may conform to a higher-level grammar
(like XML), even if that grammar also has an associated non-text
semantics (like HTML), doesn't change the fact that the string is plain
text.

So the important distinction is not between plain text and rich text,
but between plain text and non-text on the one hand, and text versus
representation on the other. Or at a higher level, between that family
of grammars that use plaintext at the lowest syntactic level, and those
that use non-text at the lowest level. The former includes SGML, HTML,
XML, RTF, SVG, etc. etc. The latter includes the MSWord doc format,
xls, image formats, various proprietary typesetting languages, etc. The
Unicode glossary would be improved if, instead of "The Unicode Standard
does not address the representation of rich text" it said something like
"Unicode does not impose any syntactic or semantic constraints on
higher-level grammars that use Unicode at the character text level."

This is important in the context of training. I occasionally have to
try to explain XML in 30 seconds or less to non-techy business types.
One of the crucial points (IMO) is that XML is plain text, which means
the kind of file corruption problems we often have with Word docs go
away, since we can use any one of thousands of plaintext editors to
examine and fix the docs. The contrast with .doc files is not plain v.
rich, but plain v. non-text, and therefore tool-agnostic v. vendor
dependent. The fact that the non-text elements of the .doc format may
represent formatting information is irrelevant; you can't edit them no
matter what they mean without a specialized editor.

Complimentary to this is the importance of the notion of a distinction
between the thing and its representation, which is where XSL stylesheets
come in. XSL stylesheets don't turn plain text into rich text; they may
generate (possibly "fancy", colorful) representations of a plain text
information asset. Such representations may themselves use a plaintext
(HTML) or a non-text (PDF) language. But the information asset remains
in plaintext. When I show somebody a hardcopy of a colorful fancied-up
PDF document generated from an XML document, I say, not "this is rich
text", but "this is a plain text document formatted with a stylesheet;
we can change it however we want without disturbing the plaintext". It
seems to me that using the terminology as you and some others recommend
would make this impossible. I just don't see how this idea of "rich
text" is really very useful.

-gregg

Next message: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Previous message: Tim Greenwood: "Re: Tamil Collation vs Transliteration/Transcription Enc"
In reply to: James Kass: "Re: unicode Digest V5 #149"
Next in thread: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Reply: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Maybe reply: Sinnathurai Srivas: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Maybe reply: James Kass: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 15:20:19 CDT