Re: second attempt from Asmus Freytag via Unicode on 2018-10-31 (Unicode Mail List Archive)

From: Asmus Freytag via Unicode <unicode_at_unicode.org>
Date: Wed, 31 Oct 2018 11:27:00 -0700

On 10/31/2018 10:32 AM, Janusz S. Bień via Unicode wrote:

Let me remind what plain text is according to the Unicode glossary:

    Computer-encoded text that consists only of a sequence of code
    points from a given standard, with no other formatting or structural
    information.

If you try to use this definition to decide what is and what is not a
character, you get vicious circle.

As mentioned already by others, there is no other generally accepted
definition of plain text.

This definition becomes tautological only when you try to invoke it in making encoding decisions, that is, if you couple it with the statement that only "elements of plain text" are ever encoded.

For that purpose, you need a number of other definitions of "plain text". Including the definition that plain text is the "backbone" to which you apply formatting and layout information. I personally believer that there are more 2D notations where it's quite obvious to me that what is "placed" is a text element. More like maps and music and less like a circuit diagram, where the elements are less text like (I deliberately include symbols in the definition of text, but not any random graphical line art).

Another definition of plain text is that which contains the "readable content" of the text. As we've discussed here, this definition has edge cases; some content is traditionally left to styling. Example: some of the small words in some Scandinavian languages are routinely italicized to disambiguate their reading. Other languages use accents for this purpose - sometimes without recognizing either the accented letter as part of the alphabet, or the accented form as a dictionary entry. Which nicely shows, that this level disambiguation is intuitively viewed as less orthographic, something that applies to the cases where italics are used for the same purpose.

In some contexts (Western Math) the scope of readable content is different than that of ordinary text. Therefore, this definition of "plain text" isn't universal. In principle, you could argue that your definition of readable content should apply; however, as a standard, Unicode will insist on limiting the encoding to text elements required by some common, widely shared and reasonably agreed-upon definition of plain text -- corresponding to a particular division between text elements and styling. So far, we have ordinary text, math and phonetics, but we don't have an agreement that reproducing all variations in manuscripts is in scope.

A./

Received on Wed Oct 31 2018 - 13:27:12 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 31 2018 - 13:27:12 CDT