From: Kent Karlsson (firstname.lastname@example.org)
Date: Wed Aug 06 2003 - 06:38:03 EDT
Kenneth Whistler wrote:
> Kent Karlsson said:
> > I see no particular *technical* problem with using WJ, though. In
> > contrast
> > to the suggestion of using CGJ (re. another problem)
> anywhere else but
> > at the end of a combining sequence. CGJ has combining class
> 0, despite
> > being invisible and not ("visually") interfering with any other
> > combining
> > mark. Using CGJ at a non-final position in a combining sequence puts
> > in doubt the entire idea with combining classes and normal forms.
See above (I DID write the motivation!). Combining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same" placement,
and "interfere typographically" are assigned the same combining class,
while those that don't get different classes, and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g. <a, ring above, cgj, dot below> supposed to be different from
<a, dot below, cgj, ring above> (supposing all involved characters
are fully supported), when <a, ring above, dot below> is NOT
supposed to be much different from <a, dot below, ring above>
(them being canonically equivalent)? An invisible combining character
does not interfere typographically with anything, it being invisible!
The other invisible (per se!) combining characters with combining
class 0, the variation selectors, are ok, since their *conforming* use
vary highly constrained. Maybe I've been wrong, but I have taken
CGJ as similarly constrained as it was given a semantics only when
followed by a base character (but now it seems to have no semantics
> There are any number of combining characters with combining
> class 0, including the vast majority of Indic dependent vowels,
> for instance.
These are ok. They are not invisible, and the vowels should not
reorder amongst themselves in a single combining sequence (I know,
there is normally only one vowel per syllable, but as the Hebrew
discussion has shown, one should not generalise too much),
regardless of placement (before, above, below, after, before&after,
So at least they should have the same combining class, regardless
of typographic placement. (This should have been the case also
for the Hebrew vowels...) But class 0 (which is specially treated),
I'm not sure if that was ideal.
> A combining character sequence is a base character followed
> by any number of combining characters. There is no constraint
> in that definition that the combining characters have to
> have non-zero combining class.
Well, you cannot *conformantly* place a VS anywhere in a combining
sequence! Only certain combinations of base+vs are allowed in
any given version of Unicode. (Breaking that does not make the
combining sequence ill-formed, or illegal, but would make it
non-conformant, just like using an unassigned code point.)
> Canonical reordering is scoped to stop at combining class = 0.
(I know it is. But I confess I'm not sure why.)
> It doesn't say that it applies to combining character sequences
> per se. It applies to *decomposed* character sequences
> (meaning, effectively, any sequence which has had the recursive
> application of the decomposition mappings done).
Yes, for the definition of normalisation. But not necessary for
canonical equivalence. Your point?
> Take a Myanmar example: /kau/:
> character sequence: <1000, 1031, 102C, 1039, 200C>
> combining?: no yes yes yes no
> combining classes: 0 0 0 9 0
> comb char sequence: ----------------------
> canon reorder scope: ---| ---| ---------| ---|
> The combining character sequence here is: <1000, 1031, 102C, 1039>
> The *syllable* consists of that plus the trailing ZWNJ.
> But the relevant sequences for application of the
> canonical reordering algorithm are each sequence starting
> with combining class zero and continuing through any
> sequence with combining class not zero.
Formally, a character *pair* based definition is enough:
xy S yx, if 0 < cc(y) < cc(x) (and apply that repeatedly);
no need to define any "canonically reordering scope", though
that may be marginally more efficient in an implementation
of normalisation (but this is getting beside the topic of this
> I don't see how introduction of CGJ into such sequences calls
> any of the definitions or algorithms into question.
No, not the algorithm, but the basic idea and design. The algorithm
as such has no "idea" how or why the combining class numbers
were assigned. But we humans do, or might have.
Again, why should not <a, ring above, cgj, dot below> be canonically
equivalent to <a, dot below, cgj, ring above>, when <a, ring above,
dot below> is canonically equivalent to <a, dot below, ring above>?
And I want a design answer, not a formal answer! (The latter I already
know, and is uninteresting.)
Since I think <a, ring above, cgj, dot below> should be canonically
equivalent to <a, dot below, cgj, ring above>, but cannot be made
so (now), the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 09:47:22 EDT