Re: BOM as WJ?

From: Peter Kirk (
Date: Thu Nov 20 2003 - 08:03:42 EST

  • Next message: Andrew C. West: "Re: creating a test font w/ CJKV Extension B characters."

    On 19/11/2003 17:44, Philippe Verdy wrote:

    > ...
    >>This trick doesn't work if any of the CC's are in combining class zero.
    >Of course, but which combining character of combining class 0 does need to
    >combine with NBSP in a way that affect renderers?
    >Do you think about sequences like <NBSP,CGJ>?
    >Or about issues when rendering <07A6;THAANA ABAFILI;Mn;0;NSM;;;;;N;;;;;>
    >after <NBSP>
    >which of wourse would be handled only as <WJ,SP,WJ,THAANA ABAFILI> ?
    >Or about: <0901;DEVANAGARI SIGN CANDRABINDU;Mn;0;NSM;;;;;N;;;;;> after
    >rendered as if it was <WJ,SP,WJ,CANDRABINDU> ?
    >Or about <0903;DEVANAGARI SIGN VISARGA;Mc;0;L;;;;;N;;;;;> after <NBSP>
    >which is this time a "Mc" character ?
    >Or about all the Indic vowels which do not seem to be really combining on
    >NBSP but would be rendered as a space followed by a defective isolated form
    >of the vowel (so without vowel glyphs reordering around the space) ?
    >Just curious...
    I wasn't thinking of any specific combining character. But I was
    thinking of the general principle that if one wants to display an
    isolated diacritic glyph, which is possible in principle, at least in
    paradigm lists (and code charts!), for any of the characters you list
    above, the recommended way of doing so is to apply them to SP or NBSP.
    Unfortunately there are many problems and undesirable side effects of
    this recommendation.

    >If we just say that <NBSP> behaves in all cases in renderers as if it was
    ><WJ,SP,WJ> where WJ is reordered with a pseudo-combining class 256, it
    >solves much problems with the interpretation of NBSP, and it looks like if
    >NBSP was a space letter; however NBSP is not a "Lo" character but really a
    >"Zs" whitespace and thus justifiable out of the end margin; also NBSP does
    >not prohibit word break but only line breaks), so it is more like if it was
    >in fact: <LJ,SP,LJ> where LJ is a line-joiner, distinct also from ZWJ
    >(zero-width joiner) used to hint ligatures but which does not brohibit any
    Well, WJ itself is actually LJ, because, astonishingly, it does not
    prohibit word breaks (see UAX29). Similarly ZWNBS, ZWJ, and ZWNJ. As
    format characters these are ignored when finding word breaks. The
    implication is that <A,B,WJ,C,D> is a single word, but
    <A,B,WJ,SPACE,WJ,C,D> and <A,B,WJ,$,WJ,C,D> are both two words despite
    the obvious attempt to use WJ to force these to be understood as one
    word (and despite the existence of alphabets in which "$" is considered

    As for line breaking (UAX14), WJ explicitly prohibits this; ZWJ and ZWNJ
    are not listed, and so as Cf characters are ignored in the line breaking
    algorithm. I note also that the combining mark CGJ is listed as GL and
    so is not CM. The descriptive text of rules LB7a-c implies that CM =
    combining mark whereas this is not in fact true; some combining marks
    are not CM and some CM are not combining marks. In rule LB7b the term
    "combining character sequence" is used, contrary to its regular defined
    meaning, for a sequence of CM characters and the preceding non-CM character.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Thu Nov 20 2003 - 08:55:31 EST