Re: Why people still want to encode precomposed letters

From: Jukka K. Korpela (
Date: Mon Nov 24 2008 - 12:19:27 CST

    Hans Aberg wrote:

    > Perhaps one only needs to list the combinations that belongs to to
    > the proper language alphabets. In Swedish that would be
    > "ijåäöÅÄÖ". Other combinations, like é, would not be as
    > important to get right in Swedish, though it is imported from the
    > French where it would appear. But it illustrates the idea.

    Technically, in the Unicode sense, “i” and “j” do not contain a diacritic
    mark but are atomic (completely non-decomposable) characters, even though a
    discussion of diacritic marks must address the issue what happens to the dot
    in them.

    The description of characters used in a language or in a locale is addressed
    in the CLDR, see
    though very unsatisfactorily, if you ask me. It only addresses letters, and
    it defines rather arbitrarily just two character sets for a language.
    Surely, for example, “e” is more basically a letter in English than “é” is,
    but “é” in turn is more of an English letter than “ē” is. Moreover, the
    pragmatic reasons for defining the character repertoires contain quite
    irrelevant points like “choosing among character encodings.”

    Anyway, describing the characters commonly used in a language is useful for
    the purposes of font design. It is a difficult task, though, and
    controversial. In practice, such descriptions are probably more useful to
    people choosing between fonts than font designers. For example, when
    choosing a font for Swedish text, you should check that å, ä, ö, é, Å, Ä, Ö,
    É all look good. This should be self-evident, but it often isn’t. Moreover,
    less common characters are even more easily ignored. Thus, lists of
    characters used in a language (at various levels of usage) are directly
    useful for constructing test documents for font testing.


