Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jul 28 2010 - 16:09:11 CDT

  • Next message: Jim Breen: "Re: Status of Unihan"

    > Message du 26/07/10 18:45
    > De : "Markus Scherer" <markus.icu@gmail.com>
    > A : verdy_p@wanadoo.fr
    > Copie : "Unicode Mailing List" <unicode@unicode.org>
    > Objet : Re: Using Combining Double Breve and expressing characters perhaps as if struck out.
    >
    > There are 857 combining marks with combining class of 0:
    > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:M:]%26[:ccc%3D0:]]&abb=on&g=
    >
    > On Sat, Jul 24, 2010 at 11:25 AM, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    >
    > > "Kent Karlsson" <kent.karlsson14@telia.com> wrote:
    > > > Den 2010-07-24 10.07, skrev "Philippe Verdy" <verdy_p@wanadoo.fr>:
    > > >
    > > > > Double diacritics have a combining property equal to zero, so they
    > > >
    > > > No, they don't. The above ones have combining class 234 and the below
    > > > ones have combining class 233 (other characters with the word DOUBLE
    > > > in them are 'double' in some other way):
    > > >
    > > > 035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;;;;;N;;;;;
    > > > ...
    > >
    > > Aren't they using the maximum value of the combining class ?
    >
    >
    > No.
    >
    > > If so,
    > > you can still use double diacritics betweeb two sequences containing a
    > > base character and any "simple" diacritic, and be sure that the double
    > > diacritic will be rendered about them, as it will remain in the last
    > > position of the normalized form.
    > >
    >
    > No. The order of combining marks only determines their rendering order if
    > they have the same combining class value. If they have different values,
    > then their rendering is supposed to be independent of their order in the
    > text. The canonical ordering in normalization only serves processing such as
    > string comparisons.

    You've not understood what I wanted to say.

    I know what you explain, but double diacritics can only be reordered
    in one case: if there's an upper double diacritic occuring before a
    lower diacritic (in which case the normalization will reorder it; as
    there's no visible difference in the result, this reordering is safe,
    and CGJ is not required to protect it).

    But given the way they will be encoded only between base graphemes,
    there's no risk that they can be swapped by normalization or that thy
    could be ordered BEFORE non-double diacritics.

    We can perfectly expect that sequences encoded with double diacritics
    will only be in that order:

    - <prependers for base 1, base 1, other simple diacritics or extenders
    for base 1 only>, then
    - <lower double diacritics, upper double diacritics>, then
    - <prependers for base 2, base 2, other simple diacritics or extendrs
    for base 2 only>

    That's what I said in sayin that they have the MAXIMUM combining class
    value. There's also NO risk that stacking double diacritics will be
    reordered within the same position, so that that use, you will never
    need CGJ.

    CGJ will only be needed if you want to append a non-double diacritic
    on top of a double, but given that this double diacritic shold not
    apply to the double diacritic itself, but to the whole group of base
    graphemes "joined" by the double diacritics, these additional
    non-double diacritics should be encoded AFTER this whole group, i.e.
    just after:

    - <prependers for base 2, base 2, other simple diacritics or extendrs
    for base 2 only>,

    if we really want to respect the logical encoding order.

    And for this use, CGJ will be incorrect (because the additional
    diacritics will STILL be part of the base grapheme cluster 2).

    We need something else, and that's were will need ZWJ instead, as the
    holder of additional diacritics that should stack on the whole group.

    OK you may avoid this problem by using CGJ immediately after the
    double diacritics (i.e. also before base grapheme cluster 2), but this
    will remain illogical.

    Well, even the double diacritics themselves are a hack in Unicode.
    Ideally we should not even need them, and instead of using:

    - <o, DOUBLE BREVE, o>

    This should be:

    - <o, ZWJ, o, ZWJ, BREVE>

    Now you can see the problem: ZWJ has never been designed to create
    structured layout groups, when used alone.
    If layout structire grouping is needed however, we could use variation
    selectors to qualify the ZWJ:

    - <o, ZWJ, VS1, o, ZWJ, VS1, BREVE>

    where the variation sequence <ZWJ,VS1> would mean here : horizontal
    group level 1.

    And so, we could encode the logicial layout structures of Hieroglyphs
    (that require multiple levels, both horizontally, and vertically) by
    defining these variation sequences:

    HGROUP1 = <ZWJ,VS1>
    VGROUP1 = <ZWJ,VS2>
    HGROUP2 = <ZWJ,VS3>
    VGROUP2 = <ZWJ,VS4>
    HGROUP3 = <ZWJ,VS5>
    VGROUP3 = <ZWJ,VS6>
    and so on...

    With this definition, then we no longer need ANY double diacritic
    variants, we just use the standard diacritics:

    - <o, HGROUP1, o, HGROUP1, BREVE>

    instead of the "deprecated" method using :

    - <o, DOUBLE BREVE, o>

    (which won't be canonically equivalent, but does it matter ?). And we
    gain a consistant encoding for "triple" diacritics or longer:

    - <o, HGROUP1, o, HGROUP1, o HGROUP1, BREVE>

    which represents a single BREVE over an horizontal grouping of three <o>.

    And with the same tool, we can almost completely encode as well the
    Egyptian hieroglyphs. This could even be part of the standard
    character encoding model !

    Philippe.



    This archive was generated by hypermail 2.1.5 : Wed Jul 28 2010 - 16:10:27 CDT