RE: Why is Unicode inconsistant?

From: Marco.Cimarosti@icl.com
Date: Mon Oct 04 1999 - 12:00:55 EDT


Dan Oscarsson said:
>> If you look att letter: 0xD8 it cannot be decomposed,
>> but letter: 0xD6 can be decomposed.
>> This is inconsistant because the glyph 0xD8 can be decomposed
>> into letter o with a combining slash.

John Cowan replied:
>Combining-slash decompositions are considered to be over the top:
>they're impossible to recombine at the glyph level accurately,
>because the position of the slash varies randomly depending on the
>base letter.

This is not very convincing. The same thing is true also with many other
combining characters. Even a diaresis, or an acute accent do look different
on different base characters (e.g. small and capital letters).

As in other cases, the rendering/font designers may apply valid strategies,
like:
- Using several different <combining slash> glyphs for different contexts;
- Using pre-composite glyphs for well known sequences like <O with slash>;
- Normalizing text to use composite characters where possible;
- Accepting ugly result in very unusual cases.

Dan's mail is inaccurate in many details, but the statement that U+00D8
should have a canonical decomposition seems right.

Regards. Marco

> -----Original Message-----
> From: John Cowan [SMTP:cowan@locke.ccil.org]
> Sent: 1999 October 04, Monday 17.25
> To: Unicode List
> Subject: Re: Why is Unicode inconsistant?
>
> Dan Oscarsson scripsit:
>
> > Looking at the Unicode character data file I see that Unicode is
> > inconsistant.
>
> Obviously this needs to go in the FAQ.
>
> > If you look att letter: 0xD8 it cannot be decomposed,
> > but letter: 0xD6 can be decomposed.
> >
> > This is inconsistant because the glyph 0xD8 can be decomposed
> > into letter o with a combining slash.
>
> Combining-slash decompositions are considered to be over the top:
> they're impossible to recombine at the glyph level accurately,
> because the position of the slash varies randomly depending on the
> base letter.
>
> "The line must be drawn here!" -- J.-L. Picard
>
> > The same inconsistancy exist for 0xC6 and 0xC4.
> > The glyph of letter 0xC4 can be decomposed into letter a with a
> combining e.
>
> U+00C4 is another boundary case: letter or ligature? But it is certainly
> not equivalent to "ae" except in Latin (the language, not the script).
>
> > It gets more inconsistant when you think about that the letter 0xC6 and
> 0xC4
> > are the same letter, but one is a Norwegian/Danish version and the other
> > Swedish.
>
> In that context, yes. But they are not really equivalent in German, and
> even
> less so in Finnish.
>
> > Why does Unicode favor one language and an other not?
>
> It does not.
>
> > It can get worse when a font is created: a letter a with a diaeresis
> > may be a different glyph than the letter 0xC4 (which have no English
> name).
>
> High-quality fonts are always language-specific: we have already learned
> that proper Polish fonts use differently placed accents from their
> Western European analogues. Unicode is concerned with *plain* text,
> in other words, whatever cannot be abandoned without abandoning
> legibility.
>
> > I have seen several bad fonts where somebody thinks that the letter
> > 0xC4 is a letter a with a diaeresis and just combined the two instead
> > of having a true letter 0xC4.
>
> Inevitably so.
>
> > Unicode need to understand the difference between precomposed characters
> > and those that are not (0xC4 is not a precomposed character, it is
> > a single letter just like 0xC6).
>
> No, it is is font designers who need to know when precomposed glyphs
> work and when they don't.
>
> --
> John Cowan cowan@ccil.org
> I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT