From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 06 2003 - 19:13:21 EDT
Philippe Verdy said:
> > The same thing can be said about any inserted invisible character,
> > combining or not.
> >
> > How is: <a, ring above, null, dot below> supposed to be different from
> > <a, dot below, null, ring above>
> >
> > How is: <a, ring above, LRM, dot below> supposed to be different from
> > <a, dot below, LRM, ring above>
> >
> > In display, they might not be distinct, unless you were doing some
> > kind of show-hidden display. Yet these sequences are not canonically
> > equivalent, and the presence of an embedded control character or an
> > embedded format control character would block canonical reordering.
>
>
> I disagree with you, using a LRM mark in the middle of a combining
> sequence is conforming to canonicalization rules but is clearly
> ill-formed,
It is not. TUS 4.0, p. 71:
D17a Defective combining character sequence: A combining character
sequence that does not start with a base character.
* Defective combining character sequences occur when a sequence
of combining characters appears at the start of a string or
follows a control or format character. Such sequences are
defective from the point of view of handling of combining
marks, but are not ill-formed.
^^^^^^^^^^^^^^^^^^^^^^
> as well as using a NULL control in the middle, which
> breaks the combining sequence.
I'm not claiming it doesn't break the combining sequence. Of
course it does. It creates a defective combining character
sequence, and that poses a challenge for rendering, since it
departs from the usual expectations for normal combining
character sequences. The renderer has to split hairs between
the fact that it is dealing with a defective combining
character sequence and the fact that it is dealing with a
default ignorable character which is supposed to be ignored
for text processes it is not immediately applicable to.
But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.
And *if* they occur, they are not canonically equivalent, which
was the point I was making to Kent.
> The proposal to use CGJ however is legal: it does not break the
> combining sequences and grapheme clusters, and thus the whole
> encoded sequence encoded with CGJ will be considered by
> rendering engines, where CGJ is a no-op for rendering but not for
> the canonical ordering ...
Well, yes, which is why I have been advocating it as the
solution to the Biblical Hebrew text representation problem.
I agree with you about that. But it need not be characterized
as "legal" in opposition to the other examples I cited above.
All of these sequences are "legal" and allowed by the
standard.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 20:21:25 EDT