Re: Why is Unicode inconsistant?

From: Michael Everson (everson@indigo.ie)
Date: Mon Oct 04 1999 - 07:42:35 EDT


Ar 23:58 -0700 1999-10-03, scríobh Dan Oscarsson:

>Looking at the Unicode character data file I see that Unicode is
>inconsistant.

Oh dear. I doubt it is. Ken, poor soul, may have to speak to the details.
But here I speak for the principles as I understand them.

>If you look att letter: 0xD8 it cannot be decomposed,

(that's LATIN CAPITAL LETTER O WITH SLASH)

>but letter: 0xD6 can be decomposed.

(that's LATIN CAPITAL LETTER O WITH DIAERESIS)

>This is inconsistant because the glyph 0xD8 [Ø] can be decomposed
>into letter o with a combining slash.

There are two combining slashes. I am not sure why the first one U+0337 is
there (unless for compatibility with ISO 5426-2 0x4C). The other one U+0338
(also in ISO 5426-2 0x4D) can represent the long slash found in some
typewriter-based phonetic transcriptions, to indicate lenition of
consonants and so on.

>The same inconsistancy exist for 0xC6 and 0xC4.

(LATIN CAPITAL LETTER AE and LATIN CAPITAL LETTER A WITH DIAERESIS
respectively)

>The glyph of letter 0xC4 [Ä] can be decomposed into letter a with a
>combining e.

There is as yet no "combining e" in the Unicode standard. Do you mean Æ ->
AE or Ä -> AE?

Use of these two letters one is language dependent. In a number of
languages (Latin, French, English) Æ is a typographical ligature which can
be decomposed into its constituent parts. In others (Icelandic, Danish,
Norwegian) Æ is a unitary character which cannot be broken up into its
(historical) constituent parts.

>It gets more inconsistant when you think about that the letter 0xC6 and 0xC4
>are the same letter, but one is a Norwegian/Danish version and the other
>Swedish.

These *characters* have a number of uses in a number of languages, not just
in Scandinavia where there is a particular identity.

>Why can one be decomposed and one not?
>The same goes for 0xD8 and 0xD6.

DIAERESIS is a productive diacritical mark which can be added to any letter
in the Latin, Greek, Cyrillic, and Georgian alphabets. Ligation of two
letters is rare; the situation where Danish Æ and French Æ are treated
differently is part of the historical richness of the Latin alphabet. It is
a little untidy, but that's the way it is. Unicode didn't invent the
problem.

>Why does Unicode favor one language and an other not?

It doesn't *favour* any one over another.

>Is just that somebody thought that the glyph for 0xC4 could be chopped
>ito pieces but not 0xC6?

No. It's that people recognized the inherent productivity of the COMBINING
DIAERESIS.

>It can get worse when a font is created: a letter a with a diaeresis
>may be a different glyph than the letter 0xC4 (which have no English name).

We call it "a with two dots" or "a with diaeresis" or "a umlaut".

>I have seen several bad fonts where somebody thinks that the letter
>0xC4 is a letter a with a diaeresis and just combined the two instead
>of having a true letter 0xC4.

As a font designer I know what you are talking about. You have to remember
something. All this technology is new. Not 10 years ago most of us weren't
even using laserprinters, but rather dot matrix printers of varying quality
(I remember fondly my Apple ImageWriter from 1988). These are still in use!
The print quality isn't that great on those, is it? Truly good fonts need
to have excellent glyphs with precomposed forms for all glyph combinations.

>Unicode need to understand the difference between precomposed characters
>and those that are not (0xC4 is not a precomposed character, it is
>a single letter just like 0xC6).

The entity encoded at U+00C4 can also be represented by the string U+0041
U+0308.

--
Michael Everson * Everson Gunn Teoranta * http://www.indigo.ie/egt
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Guthán: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement)
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT