Re: Umlaut and Tréma, was: Variation selectors and vowel marks

From: Philippe Verdy (
Date: Sun Jul 25 2004 - 12:59:28 CDT

  • Next message: Alain LaBonté: "RE: Much better Latin-1 keyboard for Windows"

    From: "busmanus" <>
    > I am not sure about the relevance of the Meteg problem, but I do know
    > about a case, where different relative positions of the same
    > diacriticals are used for conveying a semantic distinction. In a big
    > reference work about verse metrics in the world's languages (Erika
    > Szepes - István Szerdahelyi: Verstan, published by Gondolat, Budapest,
    > 1981), when discussing quantitative metrics, a macron above a breve is
    > used for denoting a neutral syllable of the metrical pattern that is
    > more frequently filled in by a short syllable than by a long one and
    > a breve above a macron is used for the reverse, i.e. the difference in
    > the combinations provides statistical information.
    > Actually, these signs are typically (although not inevitably) spacing
    > characters, but I don't think it makes a significant difference in this
    > perspective.

    When the relative ordering of diacritics becomes significant, but they have
    the same non-zero combining class, Unicode already has all the features
    needed to preserve both the logical/semantic and graphical distinction,
    because this relative order is preserved.

    However, this relative order does not specify how these diacritics stack on
    the base letter. In your example with macron and breve, they both share a
    "above" combining class, and generally most renderers will stack them
    vertically, with the first above-diacritic centered below the second

    (Some fonts or renderers could rather render them side-by-side, with the
    first diacritic on the starting side for the the current writing direction,
    and the second diacritic on the ending side; this is another stylistic
    option, which would preserve visually the semantic distinction, so this does
    not change the problem, and not a problem of Unicode itself; this case would
    happen most probably with Semitic scripts, or with Asian texts written

    The only problem will happen if the semantic distinction cannot be rendered
    visually, because the diacritics share the same combining class (so the same
    logical "position"), but not the same visual position (in some cases, even
    in the Latin script, some above-diacritics are sometimes rendered on the
    right side rather than above.)

    And we have some cases where a below-diacritic like a cedilla is preferably
    shown above-left, where it could compete with another diacritic. This is
    probably a pedantic theorical case where the default Unicode combining
    classes are inappropriate to represent correctly the interaction between

    For these reasons, I really suggest to keep CGJ as a way to encode and force
    the relative order of diacritics, and forget any other use of CGJ for
    something else than encoding a logical relative order of distinct logical
    pairs of diacritics which would otherwise become reordered identically,
    breaking the semantic of the text.

    I strongly suggest that CGJ not being used for something else than forcing
    the relative order of combining characters (and as a consequence, allowing
    CGJ only between two combining characters, but not just before or after a
    base character; should these two sequences be acceptable, as they are
    already valid in Unicode, they will represent distinct semantics for the
    base character of the combining sequence).

    As a consequence, CGJ will be inappropriate to encode a logical/semantic
    difference between umlaut and tréma for example (and the special treatment
    of umlaut versus tréma/diaeresis in German, or of the accute accent in
    Polish, for collation purpose makes CGJ inappropriate for encoding these
    logical distinctions...)

    Then, the problem remains: how can we encode logical/semantic distinctions
    of diacritics which have been unified in Unicode, but are clearly not
    unified in some languages (German and Polish are such examples...)????

    The existing variation selectors VS1..VS256 are not an option here (as they
    are breaking default grapheme clusters, meaning lots of troubles for text
    editors or text selection). Isn't it a place where we would really need some
    combining variation selectors (CVS1..CVS16 at least), to be used in
    applications or texts that need such distinctions?

    This archive was generated by hypermail 2.1.5 : Sun Jul 25 2004 - 13:03:09 CDT