Re: Letters vs. precomposed characters

From: Martin J Duerst (
Date: Sat Aug 31 1996 - 06:26:21 EDT

Michael Everson wrote:

>At 08:53 1996-08-30, Martin Duerst wrote:
>(regarding whether ? (A WITH RING) is considered to be 1 or two things)
>>Please don't confuse internal representation and user interface
>>behaviour. It is no problem to have A-ring (and any whatever
>>complicated other characters that may be interpreted as
>>combined) as single characters to the user, while still
>>using two codes internally.
>I didn't want to give this too much time today, because it is late Friday
>afternoon and I would rather have my dinner and go to the pub.

Well, it's Saturday for me, and I am buisy preparing for takeoff tomorrow
to San Jose for the Unicode conference.

>You have hit upon a serious bone of contention between Unicode encoding
>philosophy and other points of view. I was a bit shocked when Jonathan
>Rosenne said:
>>Level 1 solves the problem for a limited set of mainly European languages,
>>but the cost is very high: constant revision of the character set standard.
>>A truly "universal" character set can only be built on composition.
>>As far as I can remember, Unicode accepted pre-composed characters as part
>>of the great compromise with ISO 10646. It doesn't mean we have to think
>>of them as anything more than a pragmatic sanction.
>The underlying viewpoint here is that the "precomposed characters" in ISO
>10646 (YOUR term not OURS) were some sort of bone thrown to the Europeans
>and that everyone should just "see the light" and abandon any interest in
>revising the character set standard by adding more of them -- or perhaps
>even any interest in USING those "precomposed characters" at all.

It's not only the Europeans. The Koreans got their share recently, too.

>There are other points of view! One is that the basic alphabetic
>repertoires of natural languages should be able to be encoded with each
>character in 16 bits at Level 1. This way each letter equivalent. A is A
>and =C1 is =C1. =C5NGSTR=D6M has eight letters, AND it has eight characters, not
>ten. Swedish is lucky -- their letters are included in the standard. Some
>other languages are not.

This is an important point. For a global standard, you need composition
anyway, and for many scripts, the term letter or character is in practice
used at various levels.
The fact that ISO 10646 defines Level 1, and that most of the countries
that have bargaining power and that can afford the (as we have seen
marginal) price of combining characters have managed to push their
precombined characters into UNicode and therefore stay within
level 1, can lead to two very undesired consequences:

- Segregation into easily available Level 1 systems for the rich and
        barely available Level 2/3 systems for the poor countries.
- A vicious cycle of inclusion of more precombinations into UNicode:
        The more that can be done by Level 1, the less systems that do
        higher levels, the higher the pressure to put more precombinations
        in, and again less systems that can treat higher levels.
This is of course also accelerated by polical issues. If a country starting
to get serious about information technology discovers that all the rich
countries managed to get their compositions in, they very automatically
think "me too".

>However, research on the alphabets used in
>European languages continues, and I am preparing a CEN Technical Report
>which will provide data on all of them. (There are a lot of languages.
>There aren't quite so many letters which aren't already in the standard,
>though there are some.) Some other people in other parts of the world (such
>as Taiwan and North America) who have discussed the question of Latin
>letters with me.

Such reports are definitely very valuable to check systems for their
usability. But I think better than trying to force them into UNicode,
the Europeans should take the chance of such combinations to
really start to think global.

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT