From: Peter Kirk (email@example.com)
Date: Thu Nov 20 2003 - 08:03:42 EST
On 19/11/2003 17:44, Philippe Verdy wrote:
>>This trick doesn't work if any of the CC's are in combining class zero.
>Of course, but which combining character of combining class 0 does need to
>combine with NBSP in a way that affect renderers?
>Do you think about sequences like <NBSP,CGJ>?
>Or about issues when rendering <07A6;THAANA ABAFILI;Mn;0;NSM;;;;;N;;;;;>
>which of wourse would be handled only as <WJ,SP,WJ,THAANA ABAFILI> ?
>Or about: <0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;> after
>rendered as if it was <WJ,SP,WJ,CANDRABINDU> ?
>Or about <0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;> after <NBSP>
>which is this time a "Mc" character ?
>Or about all the Indic vowels which do not seem to be really combining on
>NBSP but would be rendered as a space followed by a defective isolated form
>of the vowel (so without vowel glyphs reordering around the space) ?
I wasn't thinking of any specific combining character. But I was
thinking of the general principle that if one wants to display an
isolated diacritic glyph, which is possible in principle, at least in
paradigm lists (and code charts!), for any of the characters you list
above, the recommended way of doing so is to apply them to SP or NBSP.
Unfortunately there are many problems and undesirable side effects of
>If we just say that <NBSP> behaves in all cases in renderers as if it was
><WJ,SP,WJ> where WJ is reordered with a pseudo-combining class 256, it
>solves much problems with the interpretation of NBSP, and it looks like if
>NBSP was a space letter; however NBSP is not a "Lo" character but really a
>"Zs" whitespace and thus justifiable out of the end margin; also NBSP does
>not prohibit word break but only line breaks), so it is more like if it was
>in fact: <LJ,SP,LJ> where LJ is a line-joiner, distinct also from ZWJ
>(zero-width joiner) used to hint ligatures but which does not brohibit any
Well, WJ itself is actually LJ, because, astonishingly, it does not
prohibit word breaks (see UAX29). Similarly ZWNBS, ZWJ, and ZWNJ. As
format characters these are ignored when finding word breaks. The
implication is that <A,B,WJ,C,D> is a single word, but
<A,B,WJ,SPACE,WJ,C,D> and <A,B,WJ,$,WJ,C,D> are both two words despite
the obvious attempt to use WJ to force these to be understood as one
word (and despite the existence of alphabets in which "$" is considered
As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ
are not listed, and so as Cf characters are ignored in the line breaking
algorithm. I note also that the combining mark CGJ is listed as GL and
so is not CM. The descriptive text of rules LB7a-c implies that CM =
combining mark whereas this is not in fact true; some combining marks
are not CM and some CM are not combining marks. In rule LB7b the term
"combining character sequence" is used, contrary to its regular defined
meaning, for a sequence of CM characters and the preceding non-CM character.
-- Peter Kirk firstname.lastname@example.org (personal) email@example.com (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Thu Nov 20 2003 - 08:55:31 EST