Re: terminology: plaintext (was Re: unicode Digest V5 #149)

From: Sinnathurai Srivas (
Date: Fri Jun 24 2005 - 19:08:49 CDT

  • Next message: Michael \(michka\) Kaplan: "Re: Tamil Collation vs Transliteration/Transcription Enc"

    What happens to text that under goes complex rendering? Does it still remain
    plain text.

    I tried to experiment this in the following way.
    Compared a linear font display with a non linear font in notepad and then
    using rich text with fully rendered font. It looks as though the display
    tries maintains plain text in note pad.


    Sinnathurai Srivas

    ----- Original Message -----
    From: "Gregg Reynolds" <>
    To: "James Kass" <>
    Cc: "Unicode" <>
    Sent: Friday, June 24, 2005 9:18 PM
    Subject: terminology: plaintext (was Re: unicode Digest V5 #149)

    > James Kass wrote:
    >> Gregg Reynolds wrote,
    >>>The unicode definition of "plain text" works for me; it's more or less
    >>>mathematical and allows us to avoid metaphysics. But you surely see that
    >>>the definition of "rich text" is hopelessly broken and inconsistent with
    >>>that of plain text, no?
    >> Surely I can see that the definition of rich text is inconsistent
    >> with that of plain text. After all, if they weren't inconsistent,
    >> they'd be the same thing and the glossary entry for "rich text"
    >> could be changed to: 'see "plain text"'.
    > consistent does not mean identical.
    >> But, what's hopelessly broken about it?
    > Hi James,
    > Sorry about getting back to you late.
    > I hope the following (longish) message will make clear I don't bring this
    > stuff up just to be curmudgeonly.
    > From the glossary:
    > "Plain Text. Computer-encoded text that consists only of a sequence of
    > code points from a given standard, with no other formatting or structural
    > information."
    > Not bad; but not good enough. It should say "a sequence of codepoints
    > *each of which has single-character semantics*...". I.e. a standard which
    > defines a codepoint for "red" or "skip 24 points" or "poodle" cannot be
    > used for plaintext.
    > "Rich Text. Also known as styled text. The result of adding information to
    > plain text. Examples of information that can be added include font data,
    > color, formatting information, phonetic annotations, interlinear text, and
    > so on. The Unicode Standard does not address the representation of rich
    > text. It is expected that systems and applications will implement
    > proprietary forms of rich text. Some public forms of rich text are
    > available (for example, ODA, HTML, and SGML). When everything except
    > primary content is removed from rich text, only plain text should remain."
    > Most obvious problem: SGML is plain text, as is XML, a subset of PDF,
    > etc. HTML is also plaintext; it happens to have some formatting semantics
    > at the lexical level, but considered as a "sequence of codepoints" it
    > clearly meets the Unicode definition of plain text. For that matter,
    > isn't RTF plaintext with formatting semantics? I'm not that familiar with
    > it, but doesn't it use a plain text character repertoire?
    > The basic problem: by these definitions, plain text and rich text are in
    > semantically different categories. One is a sequence of code points; the
    > other is - what? Figure on ground? Ink on paper? Any result of
    > presenting plain text visually?
    > What can it mean to "add information" to plain text, given that plain text
    > is by definition a sequence of codepoints? If you add "information"
    > consisting of codepoints with character semantics, then you still have
    > plain text. If you add "information" consisting of codepoints with
    > non-character semantics, well then you no longer have text of any kind.
    > You have non-text. If you add "information" by writing a syntax-coloring
    > editor, you haven't added anything to the plain text, you've added a
    > completely separate semantic layer.
    > The fact that a plain text string may conform to a higher-level grammar
    > (like XML), even if that grammar also has an associated non-text semantics
    > (like HTML), doesn't change the fact that the string is plain text.
    > So the important distinction is not between plain text and rich text, but
    > between plain text and non-text on the one hand, and text versus
    > representation on the other. Or at a higher level, between that family of
    > grammars that use plaintext at the lowest syntactic level, and those that
    > use non-text at the lowest level. The former includes SGML, HTML, XML,
    > RTF, SVG, etc. etc. The latter includes the MSWord doc format, xls, image
    > formats, various proprietary typesetting languages, etc. The Unicode
    > glossary would be improved if, instead of "The Unicode Standard does not
    > address the representation of rich text" it said something like "Unicode
    > does not impose any syntactic or semantic constraints on higher-level
    > grammars that use Unicode at the character text level."
    > This is important in the context of training. I occasionally have to try
    > to explain XML in 30 seconds or less to non-techy business types. One of
    > the crucial points (IMO) is that XML is plain text, which means the kind
    > of file corruption problems we often have with Word docs go away, since we
    > can use any one of thousands of plaintext editors to examine and fix the
    > docs. The contrast with .doc files is not plain v. rich, but plain v.
    > non-text, and therefore tool-agnostic v. vendor dependent. The fact that
    > the non-text elements of the .doc format may represent formatting
    > information is irrelevant; you can't edit them no matter what they mean
    > without a specialized editor.
    > Complimentary to this is the importance of the notion of a distinction
    > between the thing and its representation, which is where XSL stylesheets
    > come in. XSL stylesheets don't turn plain text into rich text; they may
    > generate (possibly "fancy", colorful) representations of a plain text
    > information asset. Such representations may themselves use a plaintext
    > (HTML) or a non-text (PDF) language. But the information asset remains in
    > plaintext. When I show somebody a hardcopy of a colorful fancied-up PDF
    > document generated from an XML document, I say, not "this is rich text",
    > but "this is a plain text document formatted with a stylesheet; we can
    > change it however we want without disturbing the plaintext". It seems to
    > me that using the terminology as you and some others recommend would make
    > this impossible. I just don't see how this idea of "rich text" is really
    > very useful.
    > -gregg

    This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 19:09:45 CDT