Re: Why is Unicode inconsistant?

From: John Cowan (cowan@locke.ccil.org)
Date: Mon Oct 04 1999 - 12:07:21 EDT


Dan Oscarsson scripsit:

> Looking at the Unicode character data file I see that Unicode is
> inconsistant.

Obviously this needs to go in the FAQ.

> If you look att letter: 0xD8 it cannot be decomposed,
> but letter: 0xD6 can be decomposed.
>
> This is inconsistant because the glyph 0xD8 can be decomposed
> into letter o with a combining slash.

Combining-slash decompositions are considered to be over the top:
they're impossible to recombine at the glyph level accurately,
because the position of the slash varies randomly depending on the
base letter.

        "The line must be drawn here!" -- J.-L. Picard

> The same inconsistancy exist for 0xC6 and 0xC4.
> The glyph of letter 0xC4 can be decomposed into letter a with a combining e.

U+00C4 is another boundary case: letter or ligature? But it is certainly
not equivalent to "ae" except in Latin (the language, not the script).

> It gets more inconsistant when you think about that the letter 0xC6 and 0xC4
> are the same letter, but one is a Norwegian/Danish version and the other
> Swedish.

In that context, yes. But they are not really equivalent in German, and even
less so in Finnish.

> Why does Unicode favor one language and an other not?

It does not.

> It can get worse when a font is created: a letter a with a diaeresis
> may be a different glyph than the letter 0xC4 (which have no English name).

High-quality fonts are always language-specific: we have already learned
that proper Polish fonts use differently placed accents from their
Western European analogues. Unicode is concerned with *plain* text,
in other words, whatever cannot be abandoned without abandoning legibility.

> I have seen several bad fonts where somebody thinks that the letter
> 0xC4 is a letter a with a diaeresis and just combined the two instead
> of having a true letter 0xC4.

Inevitably so.

> Unicode need to understand the difference between precomposed characters
> and those that are not (0xC4 is not a precomposed character, it is
> a single letter just like 0xC6).

No, it is is font designers who need to know when precomposed glyphs
work and when they don't.

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT