RE: Questions on ZWNBS - for line initial holam plus alef

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Wed Aug 06 2003 - 06:38:03 EDT

  • Next message: Kent Karlsson: "RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)"

    Kenneth Whistler wrote:

    > Kent Karlsson said:
    >
    > > I see no particular *technical* problem with using WJ, though. In
    > > contrast
    > > to the suggestion of using CGJ (re. another problem)
    > anywhere else but
    > > at the end of a combining sequence. CGJ has combining class
    > 0, despite
    > > being invisible and not ("visually") interfering with any other
    > > combining
    > > mark. Using CGJ at a non-final position in a combining sequence puts
    > > in doubt the entire idea with combining classes and normal forms.
    >
    > Why?

    See above (I DID write the motivation!). Combining classes are generally
    assigned according to "typographic placement". Combining characters
    (except those that are really letters) that have the "same" placement,
    and "interfere typographically" are assigned the same combining class,
    while those that don't get different classes, and the relative order is
    then considered unimportant (canonically equivalent). How is then,
    e.g. <a, ring above, cgj, dot below> supposed to be different from
    <a, dot below, cgj, ring above> (supposing all involved characters
    are fully supported), when <a, ring above, dot below> is NOT
    supposed to be much different from <a, dot below, ring above>
    (them being canonically equivalent)? An invisible combining character
    does not interfere typographically with anything, it being invisible!
    The other invisible (per se!) combining characters with combining
    class 0, the variation selectors, are ok, since their *conforming* use
    is
    vary highly constrained. Maybe I've been wrong, but I have taken
    CGJ as similarly constrained as it was given a semantics only when
    followed by a base character (but now it seems to have no semantics
    at all).

    > There are any number of combining characters with combining
    > class 0, including the vast majority of Indic dependent vowels,
    > for instance.

    These are ok. They are not invisible, and the vowels should not
    reorder amongst themselves in a single combining sequence (I know,
    there is normally only one vowel per syllable, but as the Hebrew
    discussion has shown, one should not generalise too much),
    regardless of placement (before, above, below, after, before&after,
    ...).
    So at least they should have the same combining class, regardless
    of typographic placement. (This should have been the case also
    for the Hebrew vowels...) But class 0 (which is specially treated),
    I'm not sure if that was ideal.

    > A combining character sequence is a base character followed
    > by any number of combining characters. There is no constraint
    > in that definition that the combining characters have to
    > have non-zero combining class.

    Well, you cannot *conformantly* place a VS anywhere in a combining
    sequence! Only certain combinations of base+vs are allowed in
    any given version of Unicode. (Breaking that does not make the
    combining sequence ill-formed, or illegal, but would make it
    non-conformant, just like using an unassigned code point.)

    > Canonical reordering is scoped to stop at combining class = 0.

    (I know it is. But I confess I'm not sure why.)

    > It doesn't say that it applies to combining character sequences
    > per se. It applies to *decomposed* character sequences
    > (meaning, effectively, any sequence which has had the recursive
    > application of the decomposition mappings done).

    Yes, for the definition of normalisation. But not necessary for
    canonical equivalence. Your point?

    > Take a Myanmar example: /kau/:
    >
    > character sequence: <1000, 1031, 102C, 1039, 200C>
    > combining?: no yes yes yes no
    > combining classes: 0 0 0 9 0
    > comb char sequence: ----------------------
    > canon reorder scope: ---| ---| ---------| ---|
    >
    > The combining character sequence here is: <1000, 1031, 102C, 1039>
    > The *syllable* consists of that plus the trailing ZWNJ.
    > But the relevant sequences for application of the
    > canonical reordering algorithm are each sequence starting
    > with combining class zero and continuing through any
    > sequence with combining class not zero.

    Formally, a character *pair* based definition is enough:
    xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
    no need to define any "canonically reordering scope", though
    that may be marginally more efficient in an implementation
    of normalisation (but this is getting beside the topic of this
    discussion).

    > I don't see how introduction of CGJ into such sequences calls
    > any of the definitions or algorithms into question.

    No, not the algorithm, but the basic idea and design. The algorithm
    as such has no "idea" how or why the combining class numbers
    were assigned. But we humans do, or might have.

    Again, why should not <a, ring above, cgj, dot below> be canonically
    equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
    dot below> is canonically equivalent to <a, dot below, ring above>?
    And I want a design answer, not a formal answer! (The latter I already
    know, and is uninteresting.)

    Since I think <a, ring above, cgj, dot below> should be canonically
    equivalent to <a, dot below, cgj, ring above>, but cannot be made
    so (now), the only ways out seem to be to either formally deprecate
    CGJ, or at least confine it to very specific uses. Other occurrences
    would not be ill-formed or illegal, but would then be non-conforming.

            /kent k

    > --Ken
    >



    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 09:47:22 EDT