RE: Tildes on vowels

From: Marco Cimarosti (
Date: Tue Aug 13 2002 - 12:04:22 EDT

William Overington wrote:
> >2) Superscript, subscript, combining above, and other forms of
> >identifying placement of characters, are better left to
> markup or other
> >rendering systems and file formats (and not for a vehicle
> intended for
> >plain text.)
> Why? This call for markup seems to be some deeply held belief that is
> treated as if it is a law of nature. So, some people somewhere
> decided to think in terms of layers, so, that is up to them:

It is not a deeply held belief nor a law of nature: it is an just
*arbitrary* decision.

As you correctly say, "some people somewhere decided to think in terms of
layers". These people were those who invented of Unicode, and their
*arbitrary* decision of splitting text representation in these two layers is
the foundation of Unicode.

As with all arbitrary decisions which are the foundation of something, you
either accept it or also refuse that something.

The rules of chess are totally arbitrary but, if you refuse to play by them,
you refuse to play chess. Refusing to play chess is not necessarily a bad
thing (perhaps there are nicer games), but there is no way you can play
chess by refusing to follow the rules. Chess and Unicode are just a sets of
rules: take off the rules off and nothing remains.

One of the rules of Unicode, is its definition of the border between plain
text and rich text, and its decision to only encode information below the

> the fact of the matter is that using individual Private Use Area
> characters for matters which are otherwise performable by a
> sequence of characters starting with a < character used to mean
> ENTER MARKUP BUBBLE rather than its specified meaning
> in the Unicode standard is perfectly reasonable. Using
> Private Use Area characters does not mean redefining the meaning
> of a character from the Unicode standard as does using < to mean

Because of the arbitrary, conventional, nature of the difference between
"plain text" and "rich text", one cannot use her own definition: it would be
like changing the rules of chess and pretending it is still the same game.

If you want to play Unicode, the definitions are in the on-line glossary:

        "Plain Text. Computer-encoded text that consists only of a sequence
of code points from a given standard, with no other formatting or structural
information. Plain text interchange is commonly used between computer
systems that do not share higher-level protocols. (See also fancy text.)"

        "Rich Text. (See fancy text.)"

        "Fancy Text. Also known as rich text. The result of adding
additional information to plain text. Examples of information that can be
added include font data, color, formatting information, phonetic
annotations, interlinear text, and so on. The Unicode Standard does not
address the representation of fancy text. It is expected that systems and
applications will implement proprietary forms of fancy text. Some public
forms of fancy text are available (for example, ODA, HTML, and SGML). When
everything but primary content is removed from fancy text, only plain text
should remain."

The plain text definition is not a very well-stated, unluckily. The phrase
"text that consists only of a sequence of code points from a given standard"
seems to imply that even HTML is plain text (Aren't there Unicode code
points for "<", ">", "&" and ";"?), and it only makes sense if you correctly
interpret the following phrase: "... with no other formatting or structural
information" as if it was written: "... with no other formatting or

And the fancy (rich) text definition too might need a gloss: <When all
primary content is removed from fancy text, only the additional information
should remain, and this additional information is called 'markup">.

As you see, it is nowhere said that markup is necessarily something
beginning with "<" or any other character. The additional information
("markup") can be in any format, in fact the definition says: "It is
expected that systems and applications will implement proprietary forms".

> [...] I am not knocking markup, [...]

Of course you aren't! Your idea of defining format controls as PUA code
point totally fits in the above definition.

the controls IS NOT PLAIN TEXT: it is William Overington's own "proprietary
form" of rich text.

You are out of Unicode rules not because you defined your Farmyard codes in
the PUA (which is perfectly legal, as I explain below), but because you fail
to accept (or understand) that these codes are a form of markup, and that
text containing them is a "proprietary form... of fancy text".

> >3) Some of these systems are also established and
> standardized (either
> >dejure or de facto), so creating new methods in code points is
> >unnecessary, and given the proposed misuse of the PUA (see
> next point)
> >is at conflict with the goals and architecture of Unicode.
> There is no misuse of the Private Use Area in what is being
> suggested. You might think it not a good approach, but
> labelling it as misuse is unfair.

I agree with you here, and I must disagree with Text.

Although I *do* think that your PUA format codes are not a good approach, I
also think that it is a legal usage of the PUA.

IMHO, it is fair to call it "reinventing the wheel", but not to call it

The only questionable usage of PUA that I can think of is duplicating
existing characters. But this would be an absurd deed. Your other proposal
of defining PUA ligatures goes near to this, but not quite.

> What exactly, precisely does de facto standardized mean?

Tex is probably trying to flatter you. He thinks that your Farmyard codes
can become so successful that it will not be practically possible for other
to use the same slots for something else.

_ Marco

This archive was generated by hypermail 2.1.2 : Tue Aug 13 2002 - 10:13:02 EDT