RE: Questions on ZWNBS - for line initial holam plus alef

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Thu Aug 07 2003 - 12:38:12 EDT

  • Next message: Magda Danish \(Unicode\): "FW: Web Form: Other Question: Unicode character in Visual C++6 w/ MSComm control"

    ...
    > > (them being canonically equivalent)? An invisible combining
    > character
    > > does not interfere typographically with anything, it being
    > invisible!
    >
    > The same thing can be said about any inserted invisible character,
    > combining or not.
    >
    > How is: <a, ring above, null, dot below> supposed to be different from
    > <a, dot below, null, ring above>

    The first would be an å followed by separate dot below (under a space,
    according to p. 131 of TUS 3.0). The second one is an <a, dot below>
    with a separate ring above (over a space according to TUS 3.0 p. 131).

    > How is: <a, ring above, LRM, dot below> supposed to be different from
    > <a, dot below, LRM, ring above>

    As above (yea, <a, ring above, null, dot below> would look the same
    as <a, ring above, LRM, dot below>; but neither of these are singe
    combining sequences).

    > In display, they might not be distinct, unless you were doing
    > some kind of
    > show-hidden display. Yet these sequences are not canonically
    > equivalent, and the presence of an embedded control character or an
    > embedded format control character would block canonical reordering.
    >
    > Of course, they *might* be distinct in rendering, depending on
    > what assumptions the renderer makes about default ignorable
    > characters and their interaction with combining character sequences.
    > But you cannot depend on them being distinct in display -- the
    > standard doesn't mandate the particulars here.

    Well, it does (did?) say "should"...

    > Whether you think it is *reasonable* or not that there should be
    > non-canonically equivalent ways of representing the same
    > visual display, sequences such as those above, including sequences
    > with CGJ, are possible and allowed by the standard. They are:
    >
    > a. well-formed sequences, conformantly interpretable
    > b. could be displayed by reasonable renderers, making reasonable
    > assumptions, as visually identical
    >
    > I have been pointing out use of the CGJ, which *exists* as an encoded

    Regrettable!

    > character, and which has a particular set of properties defined,
    > would result in the kinds of non-canonically equivalent ordering
    > distinctions required in Hebrew, if inserted into vowel sequences.

    As I've mentioned, if restricted (similar to the VS restrictions) to
    particular cases (like just before (or between) Hebrew (and Arabic)
    vowel marks, then ok. But only because the combining classes
    of the Arabic and Hebrew vowel marks are bizarre (read: wrong).

    ...
    > > The other invisible (per se!) combining characters with combining
    > > class 0, the variation selectors, are ok, since their *conforming*
    use
    > > is
    > > vary highly constrained. Maybe I've been wrong, but I have taken
    > > CGJ as similarly constrained as it was given a semantics only when
    > > followed by a base character (but now it seems to have no semantics
    > > at all).
    >
    > There was no such constraint defined for CGJ.

    While perhaps not explicitly stated as a restriction, the only
    *intended*
    use (after some suggestions had been dropped) was to be at the *end*
    of a combining character sequence.

    > The current statement
    > about CGJ is merely that it should be ignored in language-sensitive
    > sorting and searching unless "it specifically occurs within
    > a tailored collation element mapping." There is no constraint
    > on what particular sequences involving CGJ could be tailored
    > that way, and hence no constraint on what particular sequences
    > CGJ might occur in, in Unicode plain text.
    >
    > > > A combining character sequence is a base character followed
    > > > by any number of combining characters. There is no constraint
    > > > in that definition that the combining characters have to
    > > > have non-zero combining class.
    > >
    > > Well, you cannot *conformantly* place a VS anywhere in a combining
    > > sequence! Only certain combinations of base+vs are allowed in
    > > any given version of Unicode. (Breaking that does not make the
    > > combining sequence ill-formed, or illegal, but would make it
    > > non-conformant, just like using an unassigned code point.)
    >
    > Actually, it is not non-conformant like using an unassigned
    > code point would be. The latter is directly subject to conformance
    > clause C6:
    >
    > C6 A process shall not interpret an unassigned code point as an
    > abstract character.
    >
    > The case for variation sequences is subtly different. Suppose
    > I encounter a variation sequence <X, VS1>, where X could be
    > any Unicode character. X itself is conformantly interpretable.
    > VS1 itself is conformantly interpretable. The constraints are
    > on the interpretation of the variation sequence itself. And
    > they consist of:
    >
    > "Only the variation sequences specifically defined in the
    > file StandardizedVariants.txt in the Unicode Character
    > Database are sanctioned for standard use; in all other
    > cases the variation selector cannot change the visual
    > appearance of the preceding base character from what it
    > would have had in the absence of the variation selector."
    >
    > In other words, you can drop VS1's to your heart's content into
    > plain text, but a conformant implementation should ignore all
    > of them, unless a) it is interpreting variation selectors, and
    > b) it encounters a particular sequence defined in
    > StandardizedVariants.txt.

    But since they too have combining class 0, inserting them
    *between* combining characters (of non-zero combining class),
    they will cause a normalisation issue (not a technical problem,
    but a principles problem).

    > The cooccurrence constraints on VS1's are constraints on the
    > *encoding committees* regarding what sequences they will or will
    > not allow into StandardizedVariants.txt (for various reasons):
    >
    > "The base character in a variation sequence is never a combining
    > character or a decomposable character."
    >
    > That means the UTC will never make such a variation sequence
    > interpretable by putting it into StandardizedVariants.txt.

    Ideally the VSes should have gotten a low non-zero combining class...
    (e.g. 1)

    > *But*, a text user who drops a VS1 into Unicode plain text
    > after a combining character doesn't "commit a foul" thereby --
    > he has just put a character into a position that no conformant
    > implementation will do other than ignore on display.

    But it does mess up (hinder) the canonical reordering that
    maybe *should* have taken place! They should be constrained
    to occur just after a base character (to make up for the design
    flaw of them getting combining class 0).

    > > > Canonical reordering is scoped to stop at combining class = 0.
    > >
    > > (I know it is. But I confess I'm not sure why.)
    >
    > Because God, er...., um... Mark Davis created it that way. ;-)

    Eeh, not really the answer I expected. This particular behaviour makes
    (marginal!) sense for *enclosing* (and that means something visually...)
    combining characters. I'm not so sure it makes sense to have this
    particular behaviour for any other combining character (like combining
    vowels, or recent flurry of invisible combining characters).

    > > Yes, for the definition of normalisation. But not necessary for
    > > canonical equivalence. Your point?
    >
    > Of course it is necessary for canonical equivalence:
    >
    > D24 Canonical equivalent: Two character sequences are said to be
    ...

    That's one way of defining canonical equivalence. There are
    equivalent(!)
    ways, not going via NFD normal forms. However, I wasn't really going
    so far. I was just saying that you determine if XxyY is canonically
    equivalent
    to XyxY or not by just looking at the combining classes of the
    characters
    x and y. You need not compute the NFD forms of XxyY and XyxY before
    making that determination. (This is a rather immediate consequence of
    an alternate, but equivalent, definition of canonical equivalence.)

    ...
    > > > I don't see how introduction of CGJ into such sequences calls
    > > > any of the definitions or algorithms into question.
    > >
    > > No, not the algorithm, but the basic idea and design. The algorithm
    > > as such has no "idea" how or why the combining class numbers
    > > were assigned. But we humans do, or might have.
    >
    > True.

    Which is one of my points!

    > > Again, why should not <a, ring above, cgj, dot below> be canonically
    > > equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
    > > dot below> is canonically equivalent to <a, dot below, ring above>?
    > > And I want a design answer, not a formal answer! (The latter I
    already
    > > know, and is uninteresting.)
    >
    > The formal answer is the true and interesting answer!
    >
    > It shouldn't be canonically equivalent because it *isn't*
    > canonically equivalent.

    That's just a stability answer. It does not say why CGJ was given
    (mistakenly, I'd say) combining class 0 in the first place.

    > But instead of obsessing about the particular case of the CGJ,
    > admit that the same shenanigans can apply to any number of
    > default ignorable characters which will not result in visually
    > distinct renderings under normal assumptions about rendering.

    No, this particular problem applies only to combining characters
    of class 0 that are invisible, since they betray the very idea of
    canonical reordering.

    > I'm detecting a deeper concern here -- that such a situation
    > should not be allowed in the standard at all, as a matter
    > of design and architecture. But as a matter of practicality,
    > given the complexity of text representation needs in the
    > Unicode Standard, I don't think you can legislate these kinds
    > of edge cases away entirely.

    Again, this particular problem applies only to combining
    characters of class 0 that are invisible. Yes, there are other
    cases which are, and should be, non-equivalent, but should
    look the same (except when doing "show invisibles").

    > > Since I think <a, ring above, cgj, dot below> should be canonically
    > > equivalent to <a, dot below, cgj, ring above>, but cannot be made
    > > so (now), the only ways out seem to be to either formally deprecate
    > > CGJ, or at least confine it to very specific uses. Other occurrences
    > > would not be ill-formed or illegal, but would then be
    non-conforming.
    >
    > And I disagree with you, obviously. It should neither be
    > deprecated nor constrained from use where it may helpfully
    > solve a problem of text representation (in Biblical Hebrew).

    Emphasis: "where it may helpfully solve a problem of text
    representation (in Biblical Hebrew)". There we can agree,
    even though I don't find that particular hack to be the
    best solution. But if constrained *to* just before Hebrew
    (and Arabic?) vowels (or at the end of a combining sequence),
    ok. (Which I have said before.)

                    /kent k

    > --Ken



    This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 13:50:14 EDT