Re: Why people still want to encode precomposed letters

From: Doug Ewell (
Date: Tue Nov 18 2008 - 20:41:45 CST

  • Next message: Doug Ewell: "Re: Why people still want to encode precomposed letters"

    <abysta at yandex dot ru> wrote:

    > If I need a multi-character letter “s with acute”, I have to choose
    > between 015B and 0073+0301. Wouldn’t it be better not to have to
    > choose?

    In some ways, it probably would have been better. It certainly would
    have made things simpler to understand and to explain.

    However, at the time Unicode was conceived, it would have been
    impossible to persuade vendors and developers to make the switch from
    existing 8-bit character sets, such as those in the ISO 8859 family,
    unless most (if not all) of the mappings from these character sets to
    Unicode were 1-to-1.

    At the same time, the Unicode pioneers realized that the set of
    letters-with-diacritics was more or less open-ended, and it would be
    somewhere between extremely time-consuming and inefficient and downright
    impossible to encode them all as precomposed characters. For this
    reason and others, the combining characters were also added.

    When you choose between <015B> and <0073 0301>, you are essentially
    choosing a normalization form, and at that point, the rest of your
    decision process is fairly straightforward -- keep all your text in the
    same normalization form. This means you would not want to use both
    <015B> and <0073 0301> in the same text.

    Sometimes there are external influences that steer you toward one form
    or another. For example, the specifications for some protocols strongly
    recommend that you use Normalization Form C, in which you would use
    <015B> rather than <0073 0301>, but in which you would be obligated to
    use <04E9 0304> since there is no precomposed equivalent.

    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14  ˆ

    This archive was generated by hypermail 2.1.5 : Tue Nov 18 2008 - 20:44:23 CST