Re: Just if and where is the then?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 04 2004 - 19:38:15 CDT


Dele,

> "No new composite values will be added". - Peter Constable
>
> The above sounds dictatorial in nature.

Peter has already explained that this is just the nature
of the current policy regarding such additions. The reason
for the policy others in this thread have attempted to
explain. The short answer is that it would disturb the
stability of the definition of normalization of data involving
Unicode characters, and stability of normalization is
extremely important to many implementations of the standard.

This said, you need to understand that there is a learning
curve for people coming new to the Unicode Standard.

The existence of a policy which constrains certain kinds of
additions to the standard is not a matter of dictatorial
proclamations -- it is not something that Peter Constable or
any other individual has the power to impose.

Such policies arise out of the consensus deliberations of
the Unicode Technical Committee, which involve many different
members, jointly responsible for the technical content of
the standard. They are also endorsed in the Principles and
Procedures document for the ISO committee, JTC1/SC2/WG2
responsible for the parallel, de jure international character
encoding standard, ISO/IEC 10646. And in that committee,
decisions are also made based on consensus after discussion
among members of many different participating national bodies.

As for the particular issue regarding characters like {e with
dot below and acute accent}, for example, the policy is not
in place as a matter of discrimination against particular
languages or orthographies.

The *glyph* for {e with dot below and acute accent} can and
should be in a font for use with a language that requires
it. Alternatively, the font and/or rendering system should be
smart enough to be able to apply diacritics correctly.

But the *characters* needed to represent this are already in
the Unicode Standard, so the text in question can *already*
be handled by the standard. Trying to introduced a single,
precomposed character to do this, instead, would just introduce
normalization issues into the standard without actually
increasing its ability to represent what you need to
represent.

As Peter has explained, a "letter" or a "grapheme" doesn't
necessarily have a 1-to-1 relationship to the formal,
abstract character encoded in the Unicode Standard for use
in representing text.

You had one example already: "gb" is a "letter" in Edo. That
fact is important for education, for language learning, for
sorting, and various other things. But that "letter" is
represented by a sequence of *characters* already encoded
in Unicode: <0067, 0062>.

Likewise, if you have an acute accented e with dot below, that
may constitute a single accented "letter" in Edo, but it is
represented by a sequence of *characters* already encoded
in Unicode: <0065, 0323, 0301>.

These decisions regarding the underlying numbers representing
these elements of text are *not* required to be surfaced up
to the level of end users. Properly operating software supporting
a particular language should present the alphabetic units and
their behavior to users they way *they* expect they should
work. The fact that Unicode systems haven't gotten there in
many cases yet is an artifact of the enormous difficulty of
getting computers to work for *all* the writing systems and
languages of the world. People are working hard on the
problem, but it is a *big* problem to solve.

--Ken



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT