Letters vs. precomposed characters

From: Michael Everson (everson@indigo.ie)
Date: Fri Aug 30 1996 - 12:50:04 EDT

First I want to say that I appreciate very much the recent discussion of
how processing is not much affected by the use of combining characters.

At 08:53 1996-08-30, Martin Duerst wrote:

(regarding whether Å (A WITH RING) is considered to be 1 or two things)

>Please don't confuse internal representation and user interface
>behaviour. It is no problem to have A-ring (and any whatever
>complicated other characters that may be interpreted as
>combined) as single characters to the user, while still
>using two codes internally.

I didn't want to give this too much time today, because it is late Friday
afternoon and I would rather have my dinner and go to the pub.


You have hit upon a serious bone of contention between Unicode encoding
philosophy and other points of view. I was a bit shocked when Jonathan
Rosenne said:

>Level 1 solves the problem for a limited set of mainly European languages,
>but the cost is very high: constant revision of the character set standard.
>A truly "universal" character set can only be built on composition.
>As far as I can remember, Unicode accepted pre-composed characters as part
>of the great compromise with ISO 10646. It doesn't mean we have to think
>of them as anything more than a pragmatic sanction.

The underlying viewpoint here is that the "precomposed characters" in ISO
10646 (YOUR term not OURS) were some sort of bone thrown to the Europeans
and that everyone should just "see the light" and abandon any interest in
revising the character set standard by adding more of them -- or perhaps
even any interest in USING those "precomposed characters" at all.

There are other points of view! One is that the basic alphabetic
repertoires of natural languages should be able to be encoded with each
character in 16 bits at Level 1. This way each letter equivalent. A is A
and Á is Á. ÅNGSTRÖM has eight letters, AND it has eight characters, not
ten. Swedish is lucky -- their letters are included in the standard. Some
other languages are not. However, research on the alphabets used in
European languages continues, and I am preparing a CEN Technical Report
which will provide data on all of them. (There are a lot of languages.
There aren't quite so many letters which aren't already in the standard,
though there are some.) Some other people in other parts of the world (such
as Taiwan and North America) who have discussed the question of Latin
letters with me.

The idea is not to fill the standard with theoretical characters or every
possible linguistic combination used by specialists. But I would be
surprised to find European standardizers yielding to the pressure to
"decompose everything in sight". I suppose this means that the need to
compromise isn't over. On either side!

This is a viewpoint. It's not intended to start a flame war. I know a lot
of people have strong views about these things, myself included.

Michael Everson, Everson Gunn Teoranta
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire (Ireland)
Gutháin:  +353 1 478-2597, +353 1 283-9396
27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT