RE: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 06 2003 - 16:19:34 EDT

  • Next message: Doug Ewell: "Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)"

    Kent Karlsson responded:

    > > > I see no particular *technical* problem with using WJ, though. In
    > > > contrast
    > > > to the suggestion of using CGJ (re. another problem)
    > > anywhere else but
    > > > at the end of a combining sequence. CGJ has combining class
    > > 0, despite
    > > > being invisible and not ("visually") interfering with any other
    > > > combining
    > > > mark. Using CGJ at a non-final position in a combining sequence puts
    > > > in doubt the entire idea with combining classes and normal forms.
    > >
    > > Why?
    >
    > See above (I DID write the motivation!).

    I guess that I did not (and still do not) see the motivation for
    your final statement.

    > Combining classes are generally
    > assigned according to "typographic placement". Combining characters
    > (except those that are really letters) that have the "same" placement,
    > and "interfere typographically" are assigned the same combining class,
    > while those that don't get different classes, and the relative order is
    > then considered unimportant (canonically equivalent). How is then,
    > e.g. <a, ring above, cgj, dot below> supposed to be different from
    > <a, dot below, cgj, ring above> (supposing all involved characters
    > are fully supported), when <a, ring above, dot below> is NOT
    > supposed to be much different from <a, dot below, ring above>
    > (them being canonically equivalent)? An invisible combining character
    > does not interfere typographically with anything, it being invisible!

    The same thing can be said about any inserted invisible character,
    combining or not.

    How is: <a, ring above, null, dot below> supposed to be different from
            <a, dot below, null, ring above>
            
    How is: <a, ring above, LRM, dot below> supposed to be different from
            <a, dot below, LRM, ring above>
            
    In display, they might not be distinct, unless you were doing some kind of
    show-hidden display. Yet these sequences are not canonically
    equivalent, and the presence of an embedded control character or an
    embedded format control character would block canonical reordering.

    Of course, they *might* be distinct in rendering, depending on
    what assumptions the renderer makes about default ignorable
    characters and their interaction with combining character sequences.
    But you cannot depend on them being distinct in display -- the
    standard doesn't mandate the particulars here.

    Whether you think it is *reasonable* or not that there should be
    non-canonically equivalent ways of representing the same
    visual display, sequences such as those above, including sequences
    with CGJ, are possible and allowed by the standard. They are:

       a. well-formed sequences, conformantly interpretable
       b. could be displayed by reasonable renderers, making reasonable
          assumptions, as visually identical
          
    I have been pointing out use of the CGJ, which *exists* as an encoded
    character, and which has a particular set of properties defined,
    would result in the kinds of non-canonically equivalent ordering
    distinctions required in Hebrew, if inserted into vowel sequences.
    Those are facts about the current standard, as currently
    defined. And unless you or someone else convinces the UTC to
    establish cooccurrence constraints on CGJ or to change its
    properties, they will continue to be current facts about the
    standard.

    > The other invisible (per se!) combining characters with combining
    > class 0, the variation selectors, are ok, since their *conforming* use
    > is
    > vary highly constrained. Maybe I've been wrong, but I have taken
    > CGJ as similarly constrained as it was given a semantics only when
    > followed by a base character (but now it seems to have no semantics
    > at all).

    There was no such constraint defined for CGJ. The current statement
    about CGJ is merely that it should be ignored in language-sensitive
    sorting and searching unless "it specifically occurs within
    a tailored collation element mapping." There is no constraint
    on what particular sequences involving CGJ could be tailored
    that way, and hence no constraint on what particular sequences
    CGJ might occur in, in Unicode plain text.

    > > A combining character sequence is a base character followed
    > > by any number of combining characters. There is no constraint
    > > in that definition that the combining characters have to
    > > have non-zero combining class.
    >
    > Well, you cannot *conformantly* place a VS anywhere in a combining
    > sequence! Only certain combinations of base+vs are allowed in
    > any given version of Unicode. (Breaking that does not make the
    > combining sequence ill-formed, or illegal, but would make it
    > non-conformant, just like using an unassigned code point.)

    Actually, it is not non-conformant like using an unassigned
    code point would be. The latter is directly subject to conformance
    clause C6:

    C6 A process shall not interpret an unassigned code point as an
       abstract character.
       
    The case for variation sequences is subtly different. Suppose
    I encounter a variation sequence <X, VS1>, where X could be
    any Unicode character. X itself is conformantly interpretable.
    VS1 itself is conformantly interpretable. The constraints are
    on the interpretation of the variation sequence itself. And
    they consist of:

      "Only the variation sequences specifically defined in the
       file StandardizedVariants.txt in the Unicode Character
       Database are sanctioned for standard use; in all other
       cases the variation selector cannot change the visual
       appearance of the preceding base character from what it
       would have had in the absence of the variation selector."
       
    In other words, you can drop VS1's to your heart's content into
    plain text, but a conformant implementation should ignore all
    of them, unless a) it is interpreting variation selectors, and
    b) it encounters a particular sequence defined in
    StandardizedVariants.txt.

    The cooccurrence constraints on VS1's are constraints on the
    *encoding committees* regarding what sequences they will or will
    not allow into StandardizedVariants.txt (for various reasons):

      "The base character in a variation sequence is never a combining
       character or a decomposable character."
       
    That means the UTC will never make such a variation sequence
    interpretable by putting it into StandardizedVariants.txt.
    *But*, a text user who drops a VS1 into Unicode plain text
    after a combining character doesn't "commit a foul" thereby --
    he has just put a character into a position that no conformant
    implementation will do other than ignore on display.

    > > Canonical reordering is scoped to stop at combining class = 0.
    >
    > (I know it is. But I confess I'm not sure why.)

    Because God, er...., um... Mark Davis created it that way. ;-)

    > > It doesn't say that it applies to combining character sequences
    > > per se. It applies to *decomposed* character sequences
    > > (meaning, effectively, any sequence which has had the recursive
    > > application of the decomposition mappings done).
    >
    > Yes, for the definition of normalisation. But not necessary for
    > canonical equivalence. Your point?

    Of course it is necessary for canonical equivalence:

    D24 Canonical equivalent: Two character sequences are said to be
        canonical equivalents if their full canonical decompositions
        are identical.
        
    D23 Canonical decomposition: The decomposition of a character that
        results from recursively applying the canonical mappings found
        in the names list of Section 16.1, Character Names List, and those
        described in Section 3.12, Conjoining Jamo Behavior, until no
        characters can be further decomposed, and then reordering
                                              ^^^^^^^^^^^^^^^^^^^
        nonspacing marks according to Section 3.11, Canonical Ordering
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        Behavior.
        ^^^^^^^^
        
    > > Take a Myanmar example: /kau/:
    > >
    > > character sequence: <1000, 1031, 102C, 1039, 200C>
    > > combining?: no yes yes yes no
    > > combining classes: 0 0 0 9 0
    > > comb char sequence: ----------------------
    > > canon reorder scope: ---| ---| ---------| ---|
    > >
    > > The combining character sequence here is: <1000, 1031, 102C, 1039>
    > > The *syllable* consists of that plus the trailing ZWNJ.
    > > But the relevant sequences for application of the
    > > canonical reordering algorithm are each sequence starting
    > > with combining class zero and continuing through any
    > > sequence with combining class not zero.
    >
    > Formally, a character *pair* based definition is enough:
    > xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
    > no need to define any "canonically reordering scope", though
    > that may be marginally more efficient in an implementation
    > of normalisation (but this is getting beside the topic of this
    > discussion).

    I'm talking about "scope" here generically. I realize that
    the algorithm is based on pair-based swapping, and there is
    no necessity to have a formally-defined scope. The point,
    however, as you recognize, is that any character with
    cc=0 will limit the scope that any sequence of pair-swappings
    can impact.

    > > I don't see how introduction of CGJ into such sequences calls
    > > any of the definitions or algorithms into question.
    >
    > No, not the algorithm, but the basic idea and design. The algorithm
    > as such has no "idea" how or why the combining class numbers
    > were assigned. But we humans do, or might have.

    True.

    >
    > Again, why should not <a, ring above, cgj, dot below> be canonically
    > equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
    > dot below> is canonically equivalent to <a, dot below, ring above>?
    > And I want a design answer, not a formal answer! (The latter I already
    > know, and is uninteresting.)

    The formal answer is the true and interesting answer!

    It shouldn't be canonically equivalent because it *isn't*
    canonically equivalent.

    But instead of obsessing about the particular case of the CGJ,
    admit that the same shenanigans can apply to any number of
    default ignorable characters which will not result in visually
    distinct renderings under normal assumptions about rendering.

    I'm detecting a deeper concern here -- that such a situation
    should not be allowed in the standard at all, as a matter
    of design and architecture. But as a matter of practicality,
    given the complexity of text representation needs in the
    Unicode Standard, I don't think you can legislate these kinds
    of edge cases away entirely.

    > Since I think <a, ring above, cgj, dot below> should be canonically
    > equivalent to <a, dot below, cgj, ring above>, but cannot be made
    > so (now), the only ways out seem to be to either formally deprecate
    > CGJ, or at least confine it to very specific uses. Other occurrences
    > would not be ill-formed or illegal, but would then be non-conforming.

    And I disagree with you, obviously. It should neither be
    deprecated nor constrained from use where it may helpfully
    solve a problem of text representation (in Biblical Hebrew).

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 17:09:31 EDT