terminology: plaintext (was Re: unicode Digest V5 #149)

From: Gregg Reynolds (unicode@arabink.com)
Date: Fri Jun 24 2005 - 15:18:38 CDT

  • Next message: Asmus Freytag: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"

    James Kass wrote:
    > Gregg Reynolds wrote,
    >>The unicode definition of "plain text" works for me; it's more or less
    >>mathematical and allows us to avoid metaphysics. But you surely see
    >>that the definition of "rich text" is hopelessly broken and inconsistent
    >>with that of plain text, no?
    > Surely I can see that the definition of rich text is inconsistent
    > with that of plain text. After all, if they weren't inconsistent,
    > they'd be the same thing and the glossary entry for "rich text"
    > could be changed to: 'see "plain text"'.

    consistent does not mean identical.
    > But, what's hopelessly broken about it?

    Hi James,

    Sorry about getting back to you late.

    I hope the following (longish) message will make clear I don't bring
    this stuff up just to be curmudgeonly.

     From the glossary:

    "Plain Text. Computer-encoded text that consists only of a sequence of
    code points from a given standard, with no other formatting or
    structural information."

    Not bad; but not good enough. It should say "a sequence of codepoints
    *each of which has single-character semantics*...". I.e. a standard
    which defines a codepoint for "red" or "skip 24 points" or "poodle"
    cannot be used for plaintext.

    "Rich Text. Also known as styled text. The result of adding information
    to plain text. Examples of information that can be added include font
    data, color, formatting information, phonetic annotations, interlinear
    text, and so on. The Unicode Standard does not address the
    representation of rich text. It is expected that systems and
    applications will implement proprietary forms of rich text. Some public
    forms of rich text are available (for example, ODA, HTML, and SGML).
    When everything except primary content is removed from rich text, only
    plain text should remain."

    Most obvious problem: SGML is plain text, as is XML, a subset of PDF,
    etc. HTML is also plaintext; it happens to have some formatting
    semantics at the lexical level, but considered as a "sequence of
    codepoints" it clearly meets the Unicode definition of plain text. For
    that matter, isn't RTF plaintext with formatting semantics? I'm not
    that familiar with it, but doesn't it use a plain text character repertoire?

    The basic problem: by these definitions, plain text and rich text are in
    semantically different categories. One is a sequence of code points;
    the other is - what? Figure on ground? Ink on paper? Any result of
    presenting plain text visually?

    What can it mean to "add information" to plain text, given that plain
    text is by definition a sequence of codepoints? If you add
    "information" consisting of codepoints with character semantics, then
    you still have plain text. If you add "information" consisting of
    codepoints with non-character semantics, well then you no longer have
    text of any kind. You have non-text. If you add "information" by
    writing a syntax-coloring editor, you haven't added anything to the
    plain text, you've added a completely separate semantic layer.

    The fact that a plain text string may conform to a higher-level grammar
    (like XML), even if that grammar also has an associated non-text
    semantics (like HTML), doesn't change the fact that the string is plain

    So the important distinction is not between plain text and rich text,
    but between plain text and non-text on the one hand, and text versus
    representation on the other. Or at a higher level, between that family
    of grammars that use plaintext at the lowest syntactic level, and those
    that use non-text at the lowest level. The former includes SGML, HTML,
    XML, RTF, SVG, etc. etc. The latter includes the MSWord doc format,
    xls, image formats, various proprietary typesetting languages, etc. The
    Unicode glossary would be improved if, instead of "The Unicode Standard
    does not address the representation of rich text" it said something like
    "Unicode does not impose any syntactic or semantic constraints on
    higher-level grammars that use Unicode at the character text level."

    This is important in the context of training. I occasionally have to
    try to explain XML in 30 seconds or less to non-techy business types.
    One of the crucial points (IMO) is that XML is plain text, which means
    the kind of file corruption problems we often have with Word docs go
    away, since we can use any one of thousands of plaintext editors to
    examine and fix the docs. The contrast with .doc files is not plain v.
    rich, but plain v. non-text, and therefore tool-agnostic v. vendor
    dependent. The fact that the non-text elements of the .doc format may
    represent formatting information is irrelevant; you can't edit them no
    matter what they mean without a specialized editor.

    Complimentary to this is the importance of the notion of a distinction
    between the thing and its representation, which is where XSL stylesheets
    come in. XSL stylesheets don't turn plain text into rich text; they may
    generate (possibly "fancy", colorful) representations of a plain text
    information asset. Such representations may themselves use a plaintext
    (HTML) or a non-text (PDF) language. But the information asset remains
    in plaintext. When I show somebody a hardcopy of a colorful fancied-up
    PDF document generated from an XML document, I say, not "this is rich
    text", but "this is a plain text document formatted with a stylesheet;
    we can change it however we want without disturbing the plaintext". It
    seems to me that using the terminology as you and some others recommend
    would make this impossible. I just don't see how this idea of "rich
    text" is really very useful.


    This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 15:20:19 CDT