From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 06 2003 - 16:19:34 EDT
Kent Karlsson responded:
> > > I see no particular *technical* problem with using WJ, though. In
> > > contrast
> > > to the suggestion of using CGJ (re. another problem)
> > anywhere else but
> > > at the end of a combining sequence. CGJ has combining class
> > 0, despite
> > > being invisible and not ("visually") interfering with any other
> > > combining
> > > mark. Using CGJ at a non-final position in a combining sequence puts
> > > in doubt the entire idea with combining classes and normal forms.
> >
> > Why?
>
> See above (I DID write the motivation!).
I guess that I did not (and still do not) see the motivation for
your final statement.
> Combining classes are generally
> assigned according to "typographic placement". Combining characters
> (except those that are really letters) that have the "same" placement,
> and "interfere typographically" are assigned the same combining class,
> while those that don't get different classes, and the relative order is
> then considered unimportant (canonically equivalent). How is then,
> e.g. <a, ring above, cgj, dot below> supposed to be different from
> <a, dot below, cgj, ring above> (supposing all involved characters
> are fully supported), when <a, ring above, dot below> is NOT
> supposed to be much different from <a, dot below, ring above>
> (them being canonically equivalent)? An invisible combining character
> does not interfere typographically with anything, it being invisible!
The same thing can be said about any inserted invisible character,
combining or not.
How is: <a, ring above, null, dot below> supposed to be different from
<a, dot below, null, ring above>
How is: <a, ring above, LRM, dot below> supposed to be different from
<a, dot below, LRM, ring above>
In display, they might not be distinct, unless you were doing some kind of
show-hidden display. Yet these sequences are not canonically
equivalent, and the presence of an embedded control character or an
embedded format control character would block canonical reordering.
Of course, they *might* be distinct in rendering, depending on
what assumptions the renderer makes about default ignorable
characters and their interaction with combining character sequences.
But you cannot depend on them being distinct in display -- the
standard doesn't mandate the particulars here.
Whether you think it is *reasonable* or not that there should be
non-canonically equivalent ways of representing the same
visual display, sequences such as those above, including sequences
with CGJ, are possible and allowed by the standard. They are:
a. well-formed sequences, conformantly interpretable
b. could be displayed by reasonable renderers, making reasonable
assumptions, as visually identical
I have been pointing out use of the CGJ, which *exists* as an encoded
character, and which has a particular set of properties defined,
would result in the kinds of non-canonically equivalent ordering
distinctions required in Hebrew, if inserted into vowel sequences.
Those are facts about the current standard, as currently
defined. And unless you or someone else convinces the UTC to
establish cooccurrence constraints on CGJ or to change its
properties, they will continue to be current facts about the
standard.
> The other invisible (per se!) combining characters with combining
> class 0, the variation selectors, are ok, since their *conforming* use
> is
> vary highly constrained. Maybe I've been wrong, but I have taken
> CGJ as similarly constrained as it was given a semantics only when
> followed by a base character (but now it seems to have no semantics
> at all).
There was no such constraint defined for CGJ. The current statement
about CGJ is merely that it should be ignored in language-sensitive
sorting and searching unless "it specifically occurs within
a tailored collation element mapping." There is no constraint
on what particular sequences involving CGJ could be tailored
that way, and hence no constraint on what particular sequences
CGJ might occur in, in Unicode plain text.
> > A combining character sequence is a base character followed
> > by any number of combining characters. There is no constraint
> > in that definition that the combining characters have to
> > have non-zero combining class.
>
> Well, you cannot *conformantly* place a VS anywhere in a combining
> sequence! Only certain combinations of base+vs are allowed in
> any given version of Unicode. (Breaking that does not make the
> combining sequence ill-formed, or illegal, but would make it
> non-conformant, just like using an unassigned code point.)
Actually, it is not non-conformant like using an unassigned
code point would be. The latter is directly subject to conformance
clause C6:
C6 A process shall not interpret an unassigned code point as an
abstract character.
The case for variation sequences is subtly different. Suppose
I encounter a variation sequence <X, VS1>, where X could be
any Unicode character. X itself is conformantly interpretable.
VS1 itself is conformantly interpretable. The constraints are
on the interpretation of the variation sequence itself. And
they consist of:
"Only the variation sequences specifically defined in the
file StandardizedVariants.txt in the Unicode Character
Database are sanctioned for standard use; in all other
cases the variation selector cannot change the visual
appearance of the preceding base character from what it
would have had in the absence of the variation selector."
In other words, you can drop VS1's to your heart's content into
plain text, but a conformant implementation should ignore all
of them, unless a) it is interpreting variation selectors, and
b) it encounters a particular sequence defined in
StandardizedVariants.txt.
The cooccurrence constraints on VS1's are constraints on the
*encoding committees* regarding what sequences they will or will
not allow into StandardizedVariants.txt (for various reasons):
"The base character in a variation sequence is never a combining
character or a decomposable character."
That means the UTC will never make such a variation sequence
interpretable by putting it into StandardizedVariants.txt.
*But*, a text user who drops a VS1 into Unicode plain text
after a combining character doesn't "commit a foul" thereby --
he has just put a character into a position that no conformant
implementation will do other than ignore on display.
> > Canonical reordering is scoped to stop at combining class = 0.
>
> (I know it is. But I confess I'm not sure why.)
Because God, er...., um... Mark Davis created it that way. ;-)
> > It doesn't say that it applies to combining character sequences
> > per se. It applies to *decomposed* character sequences
> > (meaning, effectively, any sequence which has had the recursive
> > application of the decomposition mappings done).
>
> Yes, for the definition of normalisation. But not necessary for
> canonical equivalence. Your point?
Of course it is necessary for canonical equivalence:
D24 Canonical equivalent: Two character sequences are said to be
canonical equivalents if their full canonical decompositions
are identical.
D23 Canonical decomposition: The decomposition of a character that
results from recursively applying the canonical mappings found
in the names list of Section 16.1, Character Names List, and those
described in Section 3.12, Conjoining Jamo Behavior, until no
characters can be further decomposed, and then reordering
^^^^^^^^^^^^^^^^^^^
nonspacing marks according to Section 3.11, Canonical Ordering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Behavior.
^^^^^^^^
> > Take a Myanmar example: /kau/:
> >
> > character sequence: <1000, 1031, 102C, 1039, 200C>
> > combining?: no yes yes yes no
> > combining classes: 0 0 0 9 0
> > comb char sequence: ----------------------
> > canon reorder scope: ---| ---| ---------| ---|
> >
> > The combining character sequence here is: <1000, 1031, 102C, 1039>
> > The *syllable* consists of that plus the trailing ZWNJ.
> > But the relevant sequences for application of the
> > canonical reordering algorithm are each sequence starting
> > with combining class zero and continuing through any
> > sequence with combining class not zero.
>
> Formally, a character *pair* based definition is enough:
> xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
> no need to define any "canonically reordering scope", though
> that may be marginally more efficient in an implementation
> of normalisation (but this is getting beside the topic of this
> discussion).
I'm talking about "scope" here generically. I realize that
the algorithm is based on pair-based swapping, and there is
no necessity to have a formally-defined scope. The point,
however, as you recognize, is that any character with
cc=0 will limit the scope that any sequence of pair-swappings
can impact.
> > I don't see how introduction of CGJ into such sequences calls
> > any of the definitions or algorithms into question.
>
> No, not the algorithm, but the basic idea and design. The algorithm
> as such has no "idea" how or why the combining class numbers
> were assigned. But we humans do, or might have.
True.
>
> Again, why should not <a, ring above, cgj, dot below> be canonically
> equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
> dot below> is canonically equivalent to <a, dot below, ring above>?
> And I want a design answer, not a formal answer! (The latter I already
> know, and is uninteresting.)
The formal answer is the true and interesting answer!
It shouldn't be canonically equivalent because it *isn't*
canonically equivalent.
But instead of obsessing about the particular case of the CGJ,
admit that the same shenanigans can apply to any number of
default ignorable characters which will not result in visually
distinct renderings under normal assumptions about rendering.
I'm detecting a deeper concern here -- that such a situation
should not be allowed in the standard at all, as a matter
of design and architecture. But as a matter of practicality,
given the complexity of text representation needs in the
Unicode Standard, I don't think you can legislate these kinds
of edge cases away entirely.
> Since I think <a, ring above, cgj, dot below> should be canonically
> equivalent to <a, dot below, cgj, ring above>, but cannot be made
> so (now), the only ways out seem to be to either formally deprecate
> CGJ, or at least confine it to very specific uses. Other occurrences
> would not be ill-formed or illegal, but would then be non-conforming.
And I disagree with you, obviously. It should neither be
deprecated nor constrained from use where it may helpfully
solve a problem of text representation (in Biblical Hebrew).
--Ken
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 17:09:31 EDT