L2/03-430 From: Mark Davis Subject: Problem with Khmer / ZWJ / ZWNJ Date: November 10, 2003 Please make this a document for the next meeting. =========================== It appears that we have an inconsistency in the description of Khmer structure and the characteristics of ZWJ and ZWNJ (see the email trail below). The UTC should take up this topic at the next meeting and figure out what needs to be done. Mark ----- Original Message ----- From: "Kent Karlsson" To: "'Peter Kirk'" ; "'Mark Davis'" Cc: "'Unicode List'" ; "'Roozbeh Pournader'" Sent: Mon, 2003 Nov 10 03:01 Subject: RE: ZWJ, ZWNJ, CGJ and combination ... > > I would see this use of ZWJ and ZWNJ as a mistake. But the publication > of this use made me propose to make ZWJ and ZWNJ into combining > characters. However, that was not accepted since that would interfere > with the Bidi algorithm. I'm not sure how bad that would be though. > (I wouldn't be surprised if it even would be beneficial, though it would > be a break in method compared to the current specification.) > > /kent k > ----- Original Message ----- From: "Peter Kirk" To: "Mark Davis" Cc: "Unicode List" Sent: Sun, 2003 Nov 09 13:13 Subject: Re: ZWJ, ZWNJ, CGJ and combination ... > > > But does the Khmer script follow this rule? Please bear in mind that I > know nothing about this script. But in TUS v4.0 10.4 p.281 I read: > > > Ordering of Syllable Components. The standard order of components in > > an orthographic > > syllable as expressed in BNF is > > B {R | C} {S {R}}* {{Z} V} {O} {S} > > where > > B is a base character (consonant character, independent vowel character, > > and so on) > > R is a robat > > C is a consonant shifter > > S is a subscript consonant or independent vowel sign > > V is a dependent vowel sign > > Z is the zero width non-joiner > > O is any other sign > > > The first example given using ZWNJ, on p.282, starts with ba + ZWNJ + > triisap + ii, i.e. <1794, ZWNJ, 17CA, 17B8>. 1794 is a base character > (Lo), but 17CA and 17B8 are class 0 combining characters (Mn). The > syntax implies that other Mn characters, e.g. robat, 17CC, may occur > between the base character and the ZWNJ. So here is a case in natural > language where ZWNJ may be both preceded and followed by combining > characters, giving a technically defective combining sequence. Or have I > misunderstood things here? > > Note that I am not proposing a change to Khmer, but just a clarification > of definitions and the consistency of their application, and a good > reason why what is allowed in Khmer would not be allowed in Hebrew. > > -- > Peter Kirk > peter@qaya.org (personal) > peterkirk@qaya.org (work) > http://www.qaya.org/ > > > ----- Original Message ----- From: "Mark Davis" To: "Peter Kirk" Cc: "Unicode List" Sent: Sun, 2003 Nov 09 11:11 Subject: Re: ZWJ, ZWNJ, CGJ and combination > Let's try to be clear on the terms. > > Look at the definition of combining sequences: > D17 Combining character sequence: A character sequence consisting of either a > base character followed by a sequence of one or more combining characters, or a > sequence of one or more combining characters. > > Thus a combining character sequence *cannot* contain a ZWJ or any other Cf. > > Any use of a ZWJ before a combining mark produces a *defective* combining > character sequence (D17a), which isolates the combining mark from any preceeding > base character. > > And as I said earlier: > > > - *Default* grapheme clusters do not include ZWJ; as a matter of fact, default > > grapheme clusters, except for Hangul Jamo Syllables and a few exceptional > cases, > > are identical with combining sequences. > > http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries > > > - *Tailored* grapheme clusters may include longer sequences, but it is not at > > all obvious whether they would contain ever ZWJ or ZWNJ. > > I'll expand on the latter. What constitutes a tailored grapheme cluster is up to > a particular process, and so one could contain a ZWJ. However, any combining > mark after a ZWJ does *not* apply to a previous base character within that > tailored grapheme cluster, so the use of a ZWJ would isolate that combining > mark. Such a sequence would not correspond to anything used in a natural > language. > > Mark > > ----- Original Message ----- > From: "Peter Kirk" > To: "Mark Davis" > Cc: "Unicode List" > Sent: Sun, 2003 Nov 09 09:19 > Subject: Re: ZWJ, ZWNJ, CGJ and combination > > > > On 08/11/2003 17:09, Mark Davis wrote: > > > > >I agree with the first part of your analysis. By the phrase "requesting > ligation > > >of combining characters" it is unclear to me what you mean, and whether that > is > > >the right solution to whatever problem you are referring to. > > > > > >Mark > > > > > > > > > > > A further reply to this one: > > > > On the bidi list Paul Nelson pointed out that in Khmer ZWJ and ZWNJ do > > not break combining sequences; or at least they do not break grapheme > > clusters, which is not quite the same thing. And the same may be true of > > Indic scripts, although in the examples I found ZWJ/ZWNJ is always at > > the end of a combining sequence. Are ZWJ and ZWNJ actually used within > > combining character sequences (or what would be such sequences if not > > technically broken)? Is there some tension here with the general > > definition of combining character sequences? > > > > If Khmer really does do this, and unless there are any real objections > > to this practice, perhaps the best way ahead, rather than defining a new > > COMBINING CHARACTER JOINER and changing the Khmer encoding, is to adjust > > the definition of combining character sequences to allow ZWJ, ZWNJ and > > perhaps some other suitable layout control characters to be included > > within such sequences. This would allow the Hebrew issue to be solved in > > a way analogous to the Khmer issue. > > > > -- > > Peter Kirk > > peter@qaya.org (personal) > > peterkirk@qaya.org (work) > > http://www.qaya.org/ > >