RE: various ways of making a specific character

From: Philippe Verdy (
Date: Thu May 24 2007 - 11:24:25 CDT

  • Next message: António Martins-Tuválkin: "Ingush (?) latin letter not yet encoded?"

    The Unicode normalization algorithm specifies which of these canonically
    equivalent sequences is the preferred interpretation. As there's no evidence
    that none of them is graphically resolvable ans as the preferred reading
    order would be dependant of the language using sich characters with multiple
    diacritics, this should not make any difference of interpretation if you use
    one of the other of the 3 possible sequences (although their strict identity
    may still be distinct only at the graphical rendering level, due to
    implementation limits).

    If there are multiple diacritics, and effective reasons why their intended
    reading semantic order is important, then you need to use some invisible
    combining joiner when encoding the grapheme (if the renderer can exhibit the
    differences, then it will follow the hints provided by these joiners when
    selecting the appropriate glyphs, but some renderers may still default to
    using the same visual appearance even when using an explicit combining
    joiner to specify the intended order).

    But for a dot below and a dot above diacritic, I see absolutely no way in
    which they would collide, so their encoding order does not matter, and there
    should be no combining joiner encoded between these two except in cases like

    <BASE letter, combining dot above, surrounding circle, combining dot below>
    <BASE letter, combining dot below, surrounding circle, combining dot above>,

    where the combining surrounding circle would block the reordering.

    The relative values of non-zero combining classes don't have any
    demonstrated semantic meaning and do not imply any forced reading order
    excep if the diacritics may collide graphically for the same place (this is
    easily seen in the Hebrew script, where you need joiners to help
    disambiguate the semantic order of a sequence of diacritics in order to
    generate the correct visual rendering).

    There are however known exceptions for the case of some diacritics used in
    the Latin script, like the cedilla which moves from the usual below position
    to the top-left position depending of the letter-case of the base letter:
    for such extremely rare cases, it could be necessary to add joiners to avoid
    ambiguities, if the default order specified by the relative combining
    classes of diacritics is not the correct one.

    Note that in all cases, if the multiple diacritics have the same combining
    class, then their relative encoding order in texts is significant as
    distinct orders of these diacritics are NOT canonically equivalent.

    Note also that Unicode does not currently specify how multiple diacritics
    stack around the base letter; generally above and below diacritics do
    implicitly stack vertically for the generic diacritics of alphabetic
    scripts, and horizontally for the diacritics of semitic abjads.

    If you want to specify another combining mode, then you'll need to encode
    some additional combining joiner. But the currently defined combining joiner
    does not specify that; instead it just helps resolving the semantic reading
    order in a sequence of diacritrics, but does not specify their relative
    layout: this is still something that Unicode needs to describe more
    formally, with additional joining properties for diacritics, and possibly
    the definition and encoding of new combining joiners.

    > -----Message d'origine-----
    > De: [] De la
    > part de Agnieszka Kasprzyk
    > Envoy: jeudi 24 mai 2007 14:01
    > :
    > Objet: various ways of making a specific character
    > Hello,
    > I work for the union catalog of Polish libraries. Our contributors use ISO
    > transliteration standards.
    > Could you explain me how to deal with those characters from
    > transliteration
    > standards that do not exist as precomposed characters in Unicode but they
    > are combined of others BUT they may be combined in a number of different
    > ways. Which is the correct way?
    > Example:
    > ISO 259: 1984 Transliteration of Hebrew characters into Latin characters
    > requires us to enter letter t with dot below and above and letter s with
    > dot
    > below and above.
    > Now each of these characters may be built of:
    > a) letter t/s (U+0073/U+0074) + combining dot below (U+0323) + combining
    > dot
    > above (U+0307)
    > b) letter t/s with dot below (U+1E6D/U+1E63)+ combining dot above (U+0307)
    > c) letter t/s with dot above (U+1E6B/U+1E61) + combining dot below
    > (U+0323)
    > Other cases are for instance letters with two diacritics one over the
    > other.
    > Should it be base letter + upper character + lower character, or base
    > letter
    > + character which is closer + character which is further from the base
    > letter, or if it's possible, base letter with one diacritic as one
    > character
    > + the other diacritic as the combining character?
    > What is the rule to follow in such cases? Is there any document specifying
    > what to do?
    > I would really appreciate your help with this,
    > thank you,
    > Agnieszka Kasprzyk
    > mail:
    > NUKAT Center, Warsaw University Library, Poland
    > --------------------------------------------------------------------------
    > -------------
    > Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.
    > --------------------------------------------------------------------------
    > -------------
    > Orange vous informe que cet e-mail a ete controle par l'anti-virus mail.
    > Aucun virus connu a ce jour par nos services n'a ete detecte.

    This archive was generated by hypermail 2.1.5 : Thu May 24 2007 - 11:26:43 CDT