Re: The Atomic Theory of Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 06 1999 - 22:43:29 EDT


Jonathan Coxhead has supplied a long document introducing his
Atomic Theory of Unicode.

While I appreciate that a great deal of work went into this document,
I have to reiterate my basic position that the direction that Mr.
Coxhead is proposing is misguided and at odds with the
expressed intent of the two technical committees charged with
maintaining and extending the two synchronized standards, The Unicode
Standard and ISO/IEC 10646.

What Mr. Coxhead is doing is an after-the-fact componential analysis
of the repertoire which has been encoded so far, on the premise that
if these dimensions were just systematically recognized, and if
"PRESENTATION SUGGESTION XXX" characters were encoded corresponding
to those dimensions, the standard would be "more extensible, thereby
making it more universal."

The basic problem here is that these dimensions were know 10 years
ago when the initial repertoire for Unicode 1.0 was assembled. In almost
every case consideration was given to encoding specific characters
to avoid having to encode enumerated lists of variant forms of
characters already in the basic repertoire. There were extensive
discussions of this, and what emerged was the consensus agreement about
where to draw the lines between generativity (through separate encoding
of combining marks) and enumeration (of small--or even large--defective sets of
variants).

A major consideration at the outset was the need for one-to-one convertibility
to legacy (pace, Frank) character sets. Forcing early implementations
of Unicode, particularly for Asian character sets, to have n-to-1 and
1-to-n mappings for interoperability would have hindered the rollout
of pioneering implementations and thereby had a hard-to-measure but
tangible negative effect on the acceptance of Unicode. This weighed
significantly in peoples' minds, and had an impact on the general
tolerance that the committee showed toward the acceptance of piles
of c**p characters into the repertoire.

The term "Atomic Unicode" (aka "Cleanicode") as also been around for
nearly a decade, referring to Unicode as it "ought" to have been, without
any precomposed Latin, Greek, or Cyrillic, without piles of compatibility
characters of various styles, without ligatures, etc., etc. But while
Cleanicode has had advocates, it has never gotten off the ground
because it is so patently obvious to the implementers that they need
all the other stuff in the standard to deal with the rest of
the data in the world in reasonably straightforward ways.

There is another major consideration. Despite the existence of font
technology for dealing with combining marks, this area is still not
very well implemented for major scripts, including Latin, of course.
The concepts are straight-forward, but doing all the detailed work in
the fonts and the systems that use them has proven rather daunting to
the industry. It is happening, finally, but the way forward has always
proven to be: First deal with the larger character repertoire and larger
fonts, ignoring the complexity of n-to-n character-to-glyph mapping
and the complexity of script rules in general; only later try to
support the full character/glyph model and complex scripts, once having
dealt with the "simple" part of Unicode. In this regard, "opening up"
the standard by encoding lots of presentation suggestion characters
that would introduce as many as 1000's of new equivalences between
characters and sequences would likely just backfire and instead add to
the confusion and delays in dealing with Unicode rendering correctly.

> It has the side effect of giving more control to the users of the
> standard by "opening it up" so that people in special fields (e g,
> mathematics, phonetics), or those who just want strange effects in text,
> can have them without needing to petition the standardising body. This,
> I think, is what makes it more than just an exercise in classification.

This is certainly the argument that we have always used to justify
combining marks in the first place. But even for such obvious
combining marks as visible accents placed productively over, under,
or otherwise around a base character in an alphabet, there has been
enormous resistance to accepting the obvious -- and sometimes for
good implementation reasons.

IPA is already well-served by the productive possibilities available
in the standard. We will have to have the technical discussion regarding
the FUPA proposal, once it is generally available.

And as for mathematics, there has been a thorough technical airing of the
options, with good arguments on both sides. But the consensus of the
technical committee now is that completion of the standard for use in
mathematical systems is best accomplished by adding the extra alphabets
needed for math simply as lists of distinct entities -- exactly the
way the existing math systems treat them. In either case, the
mathematicians have had to "petition the standardising body", since
whichever way one went for encoding, there were characters missing.
The effort underway is to ensure that what results from this particular
"petition" is complete enough in the census of basic symbols and
math style alphabets so that full math implementations in Unicode
become possible. After that, it will simply be the normal business
of encoding the occasional oversight or oddball historic character
that someone turns up.

> My aim has been to identify the largest possible set of decomposi-
> tions, by using (or abusing) the "markup" tags present in the decompo-
> sition fields of UNIDATA.TXT as explicit presentation suggestion
> characters, and by making explicit some of the information that is only
> represented in the name or visual appearance of the character. This is
> done with a mixture of existing combining characters, some new combining
> characters, and a new type of character called a PRESENTATION
> SUGGESTION.

Note that a very explicit line had to be drawn for Latin/Cyrillic between
what was appropriate for encoding as a combining mark, and what
diacritic modifications were not. That line--decided 10 years ago--
is arbitrary, of course, but it is also explicit. Detached diacritics
were made combining marks; and *some* attached diacritics were made
combining marks: cedilla, ogonek, and the palatal hook made the cut,
partly because they were still glyphically peripheral, in a sense.
But various bars, strokes, other hooks, deformations, turnings,
and mirrorings did not make the cut, even though in every case one
can find some productivity in the use of these as diacritics to
extend orthographies. Can the cut be defended on principle? Probably
not. Various attempts were made to axiomatize the consensus about what
should be decomposed and what should not; it was a black hole for
endless argumentation. So in the end, we just drew a line and decided
to live with it. Arguing at this point about whether LATIN LETTER F WITH
HOOK should be given a formal canonical decomposition or not is
the character encoding equivalent of Indians and Pakistanis fighting
over who owns the glacier at Kargil in Kashmir.

> This note considers 3134 characters, of which 900 have canonical
> decompositions already, and are not considered further. Of the 2234
> characters left, over 1300 of them---well over half---are given new
> canonical decompositions, some of which involve one or more of 34 new
> characters, which are defined here. These characters are intended to be
> productive parts of the U C S.

It is important to keep in mind another extremely important point:
Normalization. With the rolling out of the Unicode Standard, Version 3.0,
and of the Unicode Technical Report #15, Unicode Normalization Forms,
there is a very strict dependency between the existing list of canonical
equivalences and the exact content of normalized Unicode text conforming
to the UTR #15. From this point forward, assuming that Unicode normalization
form C (the one assuming canonical composition) catches on, introduction of
either new precomposed characters *or* new combining marks would not
have the consequences that their proposers would intend. A new precomposed
character equivalent to some existing baseform plus combining mark will
*not* be normalized to the form C. And introduction of a new combining
mark that would decompose a character already encoded would result in
the decomposed representation *not* being normalized to the already
existing precomposed form. Both types of introductions would result in
apparently inexplicable failures of inequality for normalized data--
and thus will be very strongly resisted by the technical committee.

> I hope that some consideration can be given to these ideas. I even
> hope that they might forestall the encoding of large numbers of copies
> of the Latin alphabet into the U C S in the guise of mathematical
> symbols and phonetic characters, etc, while restoring the freedom of
> expression to these groups of people, and keeping the U C S down to a
> small and productive core.

Since I am opposed to this entire proposal in principle, I am not
going to argue the details of each individual character Mr.
Coxhead proposes as part of the analysis. I hope my response is
taken as due consideration, despite the fact that I will not engage
further in the details. The issue is not one of a failure of the
standard to provide freedom of expression. For mathematics and FUPA
there are simply some characters missing which will be added, once
the proposals have received enough technical review in UTC and in
WG2. Exactly what the lists will look like at the end of that technical
review is still an open question. But what is most unlikely to
happen is any back-revving of componential dimensions for existing
characters as independent presentation suggestion characters.
The unintended consequences of this well-meaning suggestion are
rather more severe than the problem Mr. Coxhead is attempting to
forestall by their proposed addition.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT