Re: ZWJ, ZWNJ, CGJ and combination

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Nov 10 2003 - 10:32:39 EST

  • Next message: Mark Davis: "Re: Hexadecimal digits?"

    This is unpleasant; I wish I had taken a closer look at the structure for Khmer
    before it went in, because it is very problematic. At this point the UTC will
    have to take up this topic and figure out what to do.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Kent Karlsson" <kentk@cs.chalmers.se>
    To: "'Peter Kirk'" <peterkirk@qaya.org>; "'Mark Davis'" <mark.davis@jtcsv.com>
    Cc: "'Unicode List'" <unicode@unicode.org>; "'Roozbeh Pournader'"
    <roozbeh@sharif.edu>
    Sent: Mon, 2003 Nov 10 03:01
    Subject: RE: ZWJ, ZWNJ, CGJ and combination

    ...
    >
    > I would see this use of ZWJ and ZWNJ as a mistake. But the publication
    > of this use made me propose to make ZWJ and ZWNJ into combining
    > characters. However, that was not accepted since that would interfere
    > with the Bidi algorithm. I'm not sure how bad that would be though.
    > (I wouldn't be surprised if it even would be beneficial, though it would
    > be a break in method compared to the current specification.)
    >
    > /kent k
    >

    ----- Original Message -----
    From: "Peter Kirk" <peterkirk@qaya.org>
    To: "Mark Davis" <mark.davis@jtcsv.com>
    Cc: "Unicode List" <unicode@unicode.org>
    Sent: Sun, 2003 Nov 09 13:13
    Subject: Re: ZWJ, ZWNJ, CGJ and combination

    ...
    > >
    > But does the Khmer script follow this rule? Please bear in mind that I
    > know nothing about this script. But in TUS v4.0 10.4 p.281 I read:
    >
    > > Ordering of Syllable Components. The standard order of components in
    > > an orthographic
    > > syllable as expressed in BNF is
    > > B {R | C} {S {R}}* {{Z} V} {O} {S}
    > > where
    > > B is a base character (consonant character, independent vowel character,
    > > and so on)
    > > R is a robat
    > > C is a consonant shifter
    > > S is a subscript consonant or independent vowel sign
    > > V is a dependent vowel sign
    > > Z is the zero width non-joiner
    > > O is any other sign
    >
    >
    > The first example given using ZWNJ, on p.282, starts with ba + ZWNJ +
    > triisap + ii, i.e. <1794, ZWNJ, 17CA, 17B8>. 1794 is a base character
    > (Lo), but 17CA and 17B8 are class 0 combining characters (Mn). The
    > syntax implies that other Mn characters, e.g. robat, 17CC, may occur
    > between the base character and the ZWNJ. So here is a case in natural
    > language where ZWNJ may be both preceded and followed by combining
    > characters, giving a technically defective combining sequence. Or have I
    > misunderstood things here?
    >
    > Note that I am not proposing a change to Khmer, but just a clarification
    > of definitions and the consistency of their application, and a good
    > reason why what is allowed in Khmer would not be allowed in Hebrew.
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >

    ----- Original Message -----
    From: "Mark Davis" <mark.davis@jtcsv.com>
    To: "Peter Kirk" <peterkirk@qaya.org>
    Cc: "Unicode List" <unicode@unicode.org>
    Sent: Sun, 2003 Nov 09 11:11
    Subject: Re: ZWJ, ZWNJ, CGJ and combination

    > Let's try to be clear on the terms.
    >
    > Look at the definition of combining sequences:
    > D17 Combining character sequence: A character sequence consisting of either a
    > base character followed by a sequence of one or more combining characters, or
    a
    > sequence of one or more combining characters.
    >
    > Thus a combining character sequence *cannot* contain a ZWJ or any other Cf.
    >
    > Any use of a ZWJ before a combining mark produces a *defective* combining
    > character sequence (D17a), which isolates the combining mark from any
    preceeding
    > base character.
    >
    > And as I said earlier:
    >
    > > - *Default* grapheme clusters do not include ZWJ; as a matter of fact,
    default
    > > grapheme clusters, except for Hangul Jamo Syllables and a few exceptional
    > cases,
    > > are identical with combining sequences.
    > > http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
    >
    > > - *Tailored* grapheme clusters may include longer sequences, but it is not
    at
    > > all obvious whether they would contain ever ZWJ or ZWNJ.
    >
    > I'll expand on the latter. What constitutes a tailored grapheme cluster is up
    to
    > a particular process, and so one could contain a ZWJ. However, any combining
    > mark after a ZWJ does *not* apply to a previous base character within that
    > tailored grapheme cluster, so the use of a ZWJ would isolate that combining
    > mark. Such a sequence would not correspond to anything used in a natural
    > language.
    >
    > Mark
    > __________________________________
    > http://www.macchiato.com
    > ► शिष्यादिच्छेत्पराजयम् ◄
    >
    > ----- Original Message -----
    > From: "Peter Kirk" <peterkirk@qaya.org>
    > To: "Mark Davis" <mark.davis@jtcsv.com>
    > Cc: "Unicode List" <unicode@unicode.org>
    > Sent: Sun, 2003 Nov 09 09:19
    > Subject: Re: ZWJ, ZWNJ, CGJ and combination
    >
    >
    > > On 08/11/2003 17:09, Mark Davis wrote:
    > >
    > > >I agree with the first part of your analysis. By the phrase "requesting
    > ligation
    > > >of combining characters" it is unclear to me what you mean, and whether
    that
    > is
    > > >the right solution to whatever problem you are referring to.
    > > >
    > > >Mark
    > > >__________________________________
    > > >http://www.macchiato.com
    > > >► शिष्यादिच्छेत्पराजयम् ◄
    > > >
    > > >
    > > >
    > > A further reply to this one:
    > >
    > > On the bidi list Paul Nelson pointed out that in Khmer ZWJ and ZWNJ do
    > > not break combining sequences; or at least they do not break grapheme
    > > clusters, which is not quite the same thing. And the same may be true of
    > > Indic scripts, although in the examples I found ZWJ/ZWNJ is always at
    > > the end of a combining sequence. Are ZWJ and ZWNJ actually used within
    > > combining character sequences (or what would be such sequences if not
    > > technically broken)? Is there some tension here with the general
    > > definition of combining character sequences?
    > >
    > > If Khmer really does do this, and unless there are any real objections
    > > to this practice, perhaps the best way ahead, rather than defining a new
    > > COMBINING CHARACTER JOINER and changing the Khmer encoding, is to adjust
    > > the definition of combining character sequences to allow ZWJ, ZWNJ and
    > > perhaps some other suitable layout control characters to be included
    > > within such sequences. This would allow the Hebrew issue to be solved in
    > > a way analogous to the Khmer issue.
    > >
    > > --
    > > Peter Kirk
    > > peter@qaya.org (personal)
    > > peterkirk@qaya.org (work)
    > > http://www.qaya.org/
    > >
    > >
    > >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 10 2003 - 11:08:57 EST