Re: Just if and where is the then?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 04 2004 - 16:14:07 CDT


From: "African Oracle" <oracle@africaservice.com>
> If a can have U+0061 and have a composite that is U+00e2...U+...
> If e can have U+0065 and have a composite that is U+00ea...U+...
>
> Then why is e with accented grave or acute and dot below cannot be assigned
> a single unicode value instead of the combinational values 1EB9 0301 and
> etc....
>
> Since UNICODE is gradually becoming a defacto, I still think it will not be
> a bad idea to have such composite values.

I think that the response is that decompositions come from the need to support
roundtrips with legacy preexisting standards. This justifies the need to offer
canonical equivalences and normalizations.

Outside this, I don't think there's a preexisting African standard with which
such canonical equivalence is needed. In fact the existence of multiple ways to
encode the same characters is a pollution, but something needed to make Unicode
work and interoperate with widely used previous legacy standards.

Finally, there has been a contractual agreement between Unicode, ISO/IEC 10646
and other standard bodies, to keep a "stability policy" for normalizations. Due
to this policy, it's impossible now to define a canonical equivalence between a
newly-encoded precombined character and a sequence composed of preexisting base
letters and diacritics.

So this mean that the only way to include e-with-acute-and-dot-below would be to
include it as a new distinct code point, WITHOUT any canonical equivalence. This
is not really a problem as long as the African languages needing this character
will adopt a consistant representation. But you will see immediately that it
will become impossible to define a standard canonical equivalence between
characters entered in decomposed forms and newer characters entered as a single
precombined code point. For Unicode, ISO/IEC 10646, and for all other standards
which depend on Unicode and which have signed the policy agreement, these
sequences will be considered distinct, for ever.

This won't be a problem if a new African standard is decided that decides to use
a single precombined code point (this standard should then really indicate that
the character is NOT decomposable).

The other way to create a new decomposable character would be to define
decompositions containing at least one NEW codepoint. I doubt this would be
desired for the base letter e, or even for the acute accent. But it may be
possible for the dot below.

One thing will mitigate this last approach: with how many base letters (possibly
precombined) must we define a composition with such new African dot below
character? Is the repertoire of letters with dot below completely closed
(including base letters with other diacritics)? As soon as such new African dot
below would be defined, all the possible preexisting letters would have to be
included in a decomposition pair. It seems difficult to achieve this goal with a
repertoire of African letters which is currently not bounded. (In the past it
was not a problem, but Unicode stability policies will not make this repertoire
extensible later once such African dot below diacritic would be introduced in
some version).

So the simplest approach is to not define anything, and enter these African
letters in their decomposed form (with the exception of letters with overlaying
or ligaturing diacritics, which should be encoded separately, without
decompositions).

Remember this: decompositions of Unicode characters is a pollution needed only
for supporting legacy standards and make them interoperable with or through
Unicode.

This Unicode policy won't prevent the possible definition of a smaller African
subset with its own charset encoding where these letters are represented in
their precomposed form only; it will also be possible to define such possible
future standard (if there's a legitimate need for it) with a complete roundtrip
compatibility with Unicode decomposed characters.

In summary, for African letters: there's no need (and it's in fact impossible
now) to encode in Unicode new letters with dots below unless the base letter is
also absent from Unicode. But barred letters are good candidates for inclusion
as isolated (not decomposable) code points.



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT