Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jul 24 2010 - 13:25:15 CDT

  • Next message: karl williamson: "? Reasonable to propose stability policy on numeric type = decimal"

    "Kent Karlsson" <kent.karlsson14@telia.com> wrote:
    > Den 2010-07-24 10.07, skrev "Philippe Verdy" <verdy_p@wanadoo.fr>:
    >
    > > Double diacritics have a combining property equal to zero, so they
    >
    > No, they don't. The above ones have combining class 234 and the below
    > ones have combining class 233 (other characters with the word DOUBLE
    > in them are 'double' in some other way):
    >
    > 035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;;;;;N;;;;;
    > ...

    Aren't they using the maximum value of the combining class ? If so,
    you can still use double diacritics betweeb two sequences containing a
    base character and any "simple" diacritic, and be sure that the double
    diacritic will be rendered about them, as it will remain in the last
    position of the normalized form.

    Anyway I also said that a character with combining class 0 was needed
    to add other diacritics on top of double diacritics, after encoding
    the two sequences joined with the double diacritic.

    Why did you assign such bogous non-zero combining class for double
    diacritics is a mystery for me, as it was really not needed for
    compatibility with legacy encodings?

    These combining classes 233 and 234 have absolutely no interest except
    that it complicated things for absolutely no benefit (including the
    fact that now an additional character with combining class 0, such as
    CGJ or other, is always needed to stack anything else on top of double
    diacritics).

    I did not realize that before (yes I should have looked in the UCD to
    verify). And given their existing behavior, this has prevented other
    simpler encodings of texts.

    Also I have NEVER found any occurence ever where the fact that they
    have combining class 233/234 instead of 0 makes any difference,
    because double diacritics where ALWAYS encoded between the two base
    graphemes encoded separately, and the canonical order preserves this
    encoding position in all cases between the two base graphemes encoded
    completely.

    Note that I'm not even sure that CGJ is the right choice for stacking
    more diacritics on top of double diacritics, because it would mean
    that the additional diacritic will need to be encoded just after the
    double diacritic and CGJ, but before the second grapheme, and this
    does not really match with double diacritics used between triplets of
    graphemes: where the additional diacritics need to be placed, on the
    first or on the second double diacritic ?

    For me the logical ordering would require encoding first the base
    graphemes, separated by the double diacritic, then encode the
    additional diacritics applicable to the whole previous group (and so
    it requires adding a new virtual base to block the reordering.

    (1) If using CGJ at end of the sequence containing the two bases and
    the double diacritic, it will still attach logically and visually the
    additional diacritics to the last base grapheme, and so they will
    still stack on them, below the double macron for example, even if
    their relative order is preserved.

    It's needless (or logically wrong), in this order, to use CGJ instead
    of ZWJ, in a sequence like:

      <base-1, double-diacritic, base-2, CGJ, additional-diacritics>

    because in that position, CGJ has no other effect to block the
    reordering of additional-diacritics as they are already blocked by
    base-2, so it would be still interpreted as:

      <base-1, double-diacritic, base-2, additional-diacritics>

    and so the additional diacritic will be linked to base-2, and the
    double diacritic will cover the full group containing <base-1> and
    <base-2, additional diacritics>

    (2) The only way to encode the additional diacritics in the middle of
    the group linked by CGJ, in this order:

      <base-1, double-diacritic, CGJ, additional diacritics..., base-2>

    and it will be impossible to have longer groups applying the double
    diacritic to more than 2 bases. This encoding using CGJ clearly breaks
    the logical assumption that the additional diacritic applying to a
    group should be all encoded AFTER the full group has been encoded.

    Here the additional diacritics need to be inserted at a specific
    position in the middle of the sequence (and in pratice, for input
    editors, they would have to scan back before base-2 through the
    additional diacritics and CGJ just to find the double-diacritic and
    see that any further diacritics need to be inserted there...)

    CGJ was not intended to apply to more than one character, but only as
    a way to block some normalized reordering of combining characters
    occuring after a single base character (which always has combining
    class 0). In that position, it should only occur between two
    combining characters with non-0 combining class, and only if the
    second onle has a lower combining class than the first one, and only
    if this creates a semantic or visual difference on rendered documents
    (for example because of the variable positions of the cedilla, that
    the combining class are unifying as if it was unique).

    (3) Using ZWJ, this terminates the last base grapheme so you can
    safely append other diacritics applying to the whole group joined by
    the double diacritic, and this becomes encoded very logically in this
    order:

      <base-1, double-diacritic, base-2, ZWJ, additional-diacritics>

    Where it will have a more consistant behavior, if ever double
    diacritics or ZWJ are not supported by the renderer to create long
    groupings. In that position, if the renderer can only draw the
    double-diacritic with nothing else on top of it, the additional
    diacritics will be drawn after the sequence of the two bases and the
    double diacritic, and only the additional diacritics will be drawn
    like a defective sequence (by drawing a dotted circle for example).

    (4) With ZWJ as the base separator with combining class 0 (just like
    CGJ which has a more "local" usage, to force the relative order of
    simple diacritics above only one base grapheme, when it has to be
    semantically different from the canonical order) between the last base
    grapheme and the addition diacritics (which I think is logically
    better than CGJ), we could *also* have longer sequences such as:

      <base-1, double-diacritic, base-2, double-diacritic, base-3, ZWJ,
    additional diacritics...>

    without any ambiguity about which double diacritic should "support"
    the additional diacritics. The occurences of double diacritics should
    be treated indistinctly where they ever occur ; by default, in a
    simple renderer, they will overlap in the middle except above the
    first and last base graphemes, but a smarter engine will avoid this
    overlap (when they are identical) and will draw a longer diacritic
    covering more all base graphemes on which the double diacritic is
    encoded.

    I've still not seen encoded texts needing that, but such groupings
    with more than two base graphemes is common in the litterature (for
    example when emphasizing trigrams like "sch" in German, or even "str"
    in English, or finals appended to conjugated verbs or declined nouns,
    or in phonetic notations needing longer ties to group complex groups
    of consonnants or diphtongs).
    In some cases, they are acting like interlinear annotations (such as
    emphasized trigrams, where it acts like an alternate underlining), but
    in others they have a semantic value within the encoded text itself
    from which they can't be safely detached (such as in phonetic
    notations, or in mathematical notations and other scientific and
    technical formulas).

    Anyway, I still think that double diacritics are a "hack" inserted in
    the UCS and now they clearly appear as an unjustified desunification
    of the diacritics: we should be able to encode the NORMAL (non-double)
    diacritics (from any Unicode block where it is already encoded) and
    apply them to an arbitrarily long group of characters, encoding the
    normal diacritics in the logical order after encoding the group,
    because:

    - most of them were added in the UCS before ZWJ was encoded.
    - this is the natural order with which they are perceived and drawn.
    - this is the natural way of interpreting the diacritics (and they are
    not necessarily "elongated")
    - the concept of groupings is inherent to the logical semantic of the
    text, and should be preserved by its encoding.

    Adding the explicit encoding of semantically significant groupings
    (and that are still missing) was certainly more important than adding
    these desunified "double" diacritics (that also have their own
    distinct combining class). Not only this encoding of double diacritics
    did not solve the problem completely within a general character model,
    but it added new exceptions and problems for automated text parsers
    and renderers.

    Philippe.



    This archive was generated by hypermail 2.1.5 : Sat Jul 24 2010 - 13:29:49 CDT