Re: terminology: plaintext (was Re: unicode Digest V5 #149)

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Fri Jun 24 2005 - 19:08:49 CDT

Next message: Michael \(michka\) Kaplan: "Re: Tamil Collation vs Transliteration/Transcription Enc"

Previous message: Sinnathurai Srivas: "Re: Tamil Collation vs Transliteration/Transcription Enc"
Maybe in reply to: Gregg Reynolds: "terminology: plaintext (was Re: unicode Digest V5 #149)"
Next in thread: James Kass: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

What happens to text that under goes complex rendering? Does it still remain
plain text.

I tried to experiment this in the following way.
Compared a linear font display with a non linear font in notepad and then
using rich text with fully rendered font. It looks as though the display
tries maintains plain text in note pad.

See http://www.geocities.com/avarangal/plain-text.jpg

Sinnathurai Srivas

----- Original Message -----
From: "Gregg Reynolds" <unicode@arabink.com>
To: "James Kass" <jameskass@att.net>
Cc: "Unicode" <unicode@unicode.org>
Sent: Friday, June 24, 2005 9:18 PM
Subject: terminology: plaintext (was Re: unicode Digest V5 #149)

> James Kass wrote:
>> Gregg Reynolds wrote,
>>
>>>The unicode definition of "plain text" works for me; it's more or less
>>>mathematical and allows us to avoid metaphysics. But you surely see that
>>>the definition of "rich text" is hopelessly broken and inconsistent with
>>>that of plain text, no?
>>
>>
>> Surely I can see that the definition of rich text is inconsistent
>> with that of plain text. After all, if they weren't inconsistent,
>> they'd be the same thing and the glossary entry for "rich text"
>> could be changed to: 'see "plain text"'.
>
> consistent does not mean identical.
>>
>> But, what's hopelessly broken about it?
>>
>
> Hi James,
>
> Sorry about getting back to you late.
>
> I hope the following (longish) message will make clear I don't bring this
> stuff up just to be curmudgeonly.
>
> From the glossary:
>
> "Plain Text. Computer-encoded text that consists only of a sequence of
> code points from a given standard, with no other formatting or structural
> information."
>
> Not bad; but not good enough. It should say "a sequence of codepoints
> *each of which has single-character semantics*...". I.e. a standard which
> defines a codepoint for "red" or "skip 24 points" or "poodle" cannot be
> used for plaintext.
>
> "Rich Text. Also known as styled text. The result of adding information to
> plain text. Examples of information that can be added include font data,
> color, formatting information, phonetic annotations, interlinear text, and
> so on. The Unicode Standard does not address the representation of rich
> text. It is expected that systems and applications will implement
> proprietary forms of rich text. Some public forms of rich text are
> available (for example, ODA, HTML, and SGML). When everything except
> primary content is removed from rich text, only plain text should remain."
>
> Most obvious problem: SGML is plain text, as is XML, a subset of PDF,
> etc. HTML is also plaintext; it happens to have some formatting semantics
> at the lexical level, but considered as a "sequence of codepoints" it
> clearly meets the Unicode definition of plain text. For that matter,
> isn't RTF plaintext with formatting semantics? I'm not that familiar with
> it, but doesn't it use a plain text character repertoire?
>
> The basic problem: by these definitions, plain text and rich text are in
> semantically different categories. One is a sequence of code points; the
> other is - what? Figure on ground? Ink on paper? Any result of
> presenting plain text visually?
>
> What can it mean to "add information" to plain text, given that plain text
> is by definition a sequence of codepoints? If you add "information"
> consisting of codepoints with character semantics, then you still have
> plain text. If you add "information" consisting of codepoints with
> non-character semantics, well then you no longer have text of any kind.
> You have non-text. If you add "information" by writing a syntax-coloring
> editor, you haven't added anything to the plain text, you've added a
> completely separate semantic layer.
>
> The fact that a plain text string may conform to a higher-level grammar
> (like XML), even if that grammar also has an associated non-text semantics
> (like HTML), doesn't change the fact that the string is plain text.
>
> So the important distinction is not between plain text and rich text, but
> between plain text and non-text on the one hand, and text versus
> representation on the other. Or at a higher level, between that family of
> grammars that use plaintext at the lowest syntactic level, and those that
> use non-text at the lowest level. The former includes SGML, HTML, XML,
> RTF, SVG, etc. etc. The latter includes the MSWord doc format, xls, image
> formats, various proprietary typesetting languages, etc. The Unicode
> glossary would be improved if, instead of "The Unicode Standard does not
> address the representation of rich text" it said something like "Unicode
> does not impose any syntactic or semantic constraints on higher-level
> grammars that use Unicode at the character text level."
>
> This is important in the context of training. I occasionally have to try
> to explain XML in 30 seconds or less to non-techy business types. One of
> the crucial points (IMO) is that XML is plain text, which means the kind
> of file corruption problems we often have with Word docs go away, since we
> can use any one of thousands of plaintext editors to examine and fix the
> docs. The contrast with .doc files is not plain v. rich, but plain v.
> non-text, and therefore tool-agnostic v. vendor dependent. The fact that
> the non-text elements of the .doc format may represent formatting
> information is irrelevant; you can't edit them no matter what they mean
> without a specialized editor.
>
> Complimentary to this is the importance of the notion of a distinction
> between the thing and its representation, which is where XSL stylesheets
> come in. XSL stylesheets don't turn plain text into rich text; they may
> generate (possibly "fancy", colorful) representations of a plain text
> information asset. Such representations may themselves use a plaintext
> (HTML) or a non-text (PDF) language. But the information asset remains in
> plaintext. When I show somebody a hardcopy of a colorful fancied-up PDF
> document generated from an XML document, I say, not "this is rich text",
> but "this is a plain text document formatted with a stylesheet; we can
> change it however we want without disturbing the plaintext". It seems to
> me that using the terminology as you and some others recommend would make
> this impossible. I just don't see how this idea of "rich text" is really
> very useful.
>
> -gregg
>
>

Next message: Michael \(michka\) Kaplan: "Re: Tamil Collation vs Transliteration/Transcription Enc"
Previous message: Sinnathurai Srivas: "Re: Tamil Collation vs Transliteration/Transcription Enc"
Maybe in reply to: Gregg Reynolds: "terminology: plaintext (was Re: unicode Digest V5 #149)"
Next in thread: James Kass: "Re: terminology: plaintext (was Re: unicode Digest V5 #149)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 24 2005 - 19:09:45 CDT