L2/00-023 From: Kenneth Whistler [kenw@sybase.com] Sent: Monday, January 24, 2000 7:41 PM To: Multiple Recipients of Unicore Cc: kenw@sybase.com Subject: Re: ZWJ contradictions; ZWL Lloyd et al., Mark begged off answering in detail until the discussion at the UTC. I did, however, struggle through all the way to the end of this communiqué, and want to answer a few of the questions of historical interest here. --Ken To place this in historical context, here is the summary of the relevant points regarding ZWJ in Unicode 1.0, 1.1, 2.0, and 3.0. **************************************************************** Unicode 1.0 p. 77. " U+200C ZWNJ is used to request that characters be rendered separately, when they would otherwise normally combine in some manner. For example, a ZWNJ between an 'f' and an 'i' will prevent an 'fi' ligature from being displayed; a ZWNJ between an Arabic NOON and MEEM will prevent the normal cursive connection from being rendered, and a ZWNJ between an 'a' and a NON-SPACING ACUTE will cause the NON-SPACING ACUTE to be displayed as a spacing character... The ZWNJ is also used in script-dependent ways; in Indic scripts, for example, to show the virama explicitly. " ... The ZWJ can be used to indicate a tighter cursive connection between characters or to form a ligature (if available) when the default would be not to form one. On the other hand, the ZWJ can be placed between already cursively-connected text with no effect: thus Arabic baa-joiner-meem will have the same appearance as baa meem. The ZWJ also has other uses in some scripts, such as Tibetan." p. 55. (Devanagari) " ... The ZWNJ can be used to prevent the formation of conjuncts, if desired. For this purpose, the non-joiner must be placed after the virama in the code stream.." p. 72. (Tibetan) " ... In some rare cases, conjoining occurs in written Tibetan without the normal shape changes (non-morphological conjunction); such cases may be encoded in plain text by using a ZWJ instead of the virama between the letters to be conjoined." **************************************************************** Unicode 1.1 p. 5 "3.2 Zero-Width Joining "... In the merger with ISO/IEC 10646-1, the semantics of these two characters have been given a narrow interpretation. This brings added precision to the explanation given in Volume 1, p. 77. "The intent of these characters is to address cursive graphical connections between the glyphs of a script, e.g. in scripts like Arabic whose printed form emulated handwriting. ZWNJ and ZWJ are best though of as behaving like tiny letters that neighboring glyphs may connect to (ZWJ) or avoid connecting to (ZWNJ). They are thus processed as ordinary cursive letters rather than as control characters. "ZWNJ and ZWJ affect how the two neighboring glyphs connect to *them*, not to *each other*. As such, they have no direct relationship with ligature formation; in particular, ZWJ does not in any way request that its two neighbors be ligatures to each other. Indeed, both ZWNJ and ZWJ may break up ligatures by interrupting the character sequence required to form the ligature. "The precise relationship between cursive appearance and ligated appearance may differ from script to script, and therefore the precise usage of these characters is script-dependent. In the case of Latin typography, corrosiveness (handwriting emulation) and ligatures are independent. Thus the text on Volume 1, p. 77, may be clarified as follows: "f + ZWJ + i will not form the ligautre fi. Instead, if cursive versions of the f and i are available in the font, each will independently connect to the ZWJ on the appropriate side (having the same appearance as f + i). "Usage of optional ligatures such as fi is not currently controlled by any codes within the Unicode standard, but is determined by protocols or resources external to the text sequence." [[ fish example seen in Unicode 2.0, page 6-71 introduced here ]] "With regard to the Arabic script, the statements in Volume 1, p. 77, remain correct. In Volume 2, p. 390, according to Arabic rules L2 and L3 the ZWJ can be used to get the appearance in parentheses. "With regard to conjuncts in Indic scripts, the statements in Volume 1, pp. 53-56, and Volume 2, pp. 399-414, remain correct. However, for clarity, the term ligature should be replaced by the term conjunct throughout pp. 399-414." **************************************************************** Unicode 2.0 I won't bother quoting the text here, as it is easily available to those concerned with this issue. The main discussion is under Layout Controls, pp. 6-70 .. 6-71. The Arabic discussion is at pp. 6-22 .. 6-23. (The discussion about ZWJ and ZWNJ was newly added in Unicode 2.0; the majority of the Arabic shaping discussion was derived from Unicode 1.0, Volume 2.) The Devanagari discussion is at pp. 6-36 .. 6-38. (This discussion is derived from Unicode 1.0, Volume 2, with the notable difference that the order of the ZWJ with respect to the virama for the encoding of explicit half-consonants was reversed.) The Tibetan text model in Unicode 2.0 is entirely different from the abortive Unicode 1.0 encoding, and no mention of ZWJ occurs in the Tibetan script description. **************************************************************** Unicode 3.0 The main discussion is moved to a section called Cursive Connection, on pp. 314..316, in Section 13.2, Layout Controls. Except for very minor copy edit changes, and removal of the anachronistic reference to Tibetan usage of ZWJ, the text is effectively identical to that of Unicode 2.0. The Arabic discussion is moved to Section 8.2 Arabic, pp. 187..188, and is effectively identical to that of Unicode 2.0. The Devanagari discussion is moved to Section 9.1 Devanagari, pp. 212.. 214, and is effectively identical to that of Unicode 2.0. **************************************************************** The upshot of this progression of the text of the standard is as follows: 1. A major clarification of ZWNJ and ZWJ was attempted in Unicode 1.1, including a restriction of their semantics to not include ligature formation. Unicode 1.0 *did* suggest that ZWJ, in particular, could be used in the way that is being requested for the proposed ZWL. That usage was ruled out explicitly in the rewording for Unicode 1.1. Furthermore, the anachronistic suggestion that ZWNJ could be used to break up a combining character sequence was dropped in Unicode 1.1. 2. The semantics of ZWNJ and ZWJ has subsequently been inherited without change from Unicode 1.1, through 2.0, and 3.0. Unicode 2.0 consolidated the text and intent from Unicode 1.1. Nothing that happened in Unicode 3.0 has touched that intent in any way. O.k. Now, are we all on the same page so far? On to the questions raised by Lloyd. > *************************************************** > > 2. There is indeed a contradiction in the wording of Unicode 2.0 > concerning the ZWJ. > I had missed the wording which Mark quoted on Jan. 14th from page 6-71: > > "although ZWJ and ZWNJ should not affect ligating behavior, > in some systems > they may break up ligatures by interrupting the character flow." > > The example "fish" at the bottom of 6-71 does indeed show the > rendering of the sequence [f + ZWJ + i + ...] as like that of [f + i + ...], > thus reinforcing the interpretation Mark prefers. > This wording is I believe an addition since Unicode 1.0? It was added in Unicode 1.1. > This is in a context in which ligating behavior and cursive connection > are distinguished (just above this on page 6-71): > > "Adding a zero width joiner between characters that are already > cursively connected will have no effect." > > The wording is not explicit about whether ligated characters > are *also* to be considered cursively joined in scripts which are cursively > connected and which also have ligatures for some sequences. The wording in Unicode 1.1 should be clear -- the intent was that ligation and cursive connection were to be considered two completely distinct issues, and that the semantics of ZWNJ and ZWJ per se had no implication for ligation. The "fish" example was to show, however, that the mere presence of either character *could* result in a non-ligation, by breaking up the sequence expected by "protocols or resources [[i.e. fonts]] external to the text sequence" for a ligature to be formed. > Therefore, the applicability of some wording is not totally explicit, > other than the "fish" example and the first quote just above. > > *** > > The contradiction arises in contrast with the following more basic wording, > which I quoted part of in my earlier message on this subject > (Unicode 2.0 page 6-70) > > "Logically, these characters do not modify the contextual selection > process itself, but rather they change the context of a particular > character occurrence. By providing a non-joining neighbor character > where otherwise the neighbor would be joining, or vice-versa, they > deceive the rendering process into selecting a different joining glyph." > > Mark was thus incorrect in stating that this older interpretation had > been rejected. It is still the basic wording, placed first, and must govern > other interpretations until changed. The "older interpretation" that Mark was stating had been rejected was that of Unicode 1.0, in which the ZWJ could be conceived of as a join-requester, including a request of a ligature. Under the new interpretation introduced (by Mark, primarily) in Unicode 1.1, I do not see a contradiction. > > In a cursively linkable Latin font, however, it could be used, consistent > with the wording above, as a means of blocking the ligature while still > permitting the cursive linking > (under the basic default that it is merely a linkable "neighbor" character). No, it could not be used "as a means of blocking the ligature..." A ZWJ in such a context might, on the other hand, have the (unintended) side-effect of blocking a ligation. > So the wording "should not affect ligating behavior" on page 6-71 > is in contradiction with the more basic wording at the bottom of page 6-70 > "do not modify the contextual selection process itself". > If ZWJ does not modify the contextual selection process, > because it is merely another neighbor character, > then it must affect ligating behavior when it interrupts sequences. No. The actual effect depends on the implementation. The ZWJ is not *intended* to interrupt ligating sequences, but processes that are unaware of this may do the wrong thing. As you pointed out below, for the purposes of ligation, a ZWJ in the midst of an Arabic sequence, for example, should be handled effectively like an Arabic voweling -- it should not disrupt the choices of the basic consonant outline (ligated or not) from the font. > > *************************************************** > > 3. Implementations do not "break" if they cannot handle incorrect > spellings, spellings which have no function and should not occur. > > Here is Mark's statement from Jan. 14th (referring also to ZWL) > > "fonts that don't fully support ZWL would actually cause it to > break ligatures, just as current fonts that don't fully support ZWJ > cause it to break ligatures. This is not part of the semantics of the > character, it is just what happens without full implementation." > > There are two aspects to this. > > *** > > (a) In the last part, saying that the breaking > of ligatures is not part of the semantics of the character, it is just > what happens without full implementation. > Quite on the contrary, > it is what happens with FULL implementation, under the basic > definition given on page 6-70. This interpretation is contrary to the explicit intent of ZWNJ and ZWJ as described in Unicode 1.1 et seq. Mark's statement as it stands is correct, I believe. Your interpretation of the text on page 6-70 is incomplete. > > Mark's statement can of course simply be reworded as the following, > which would be a very useful addition to the standard to resolve the > contradiction. It merely makes the wording of the bottom of page 6-70 > more explicit, and the following wording, if substituted for that at the > bottom of 6-71, would remove all contradictions from the standard > on this issue. Any other changes would be much more elaborate, > and would entail additional contradictions. Please see sections 4. and 5. > also. > > "ZWJ should not normally be introduced between characters which > form a ligature in fonts which are not cursively linking. ZWJ has > no legitimate function there. Its introduction there is a spelling > error > and will usually produce exactly the opposite effect from that intended, > by breaking the sequence of characters and preventing their rendering > as a ligature" I concur that something similar to this might be usefully added -- although I don't think it is resolving a contradiction. Any number of other invisible format controls -- if not properly ignored by a rendering process -- could have the same unintended and user-inexplicable effect. Introduction of a Left-to-right-mark or a ZWNBSP (U+FEFF) or an INHIBIT SYMMETRIC SWAPPING (U+206A) in such a context could also fool a rendering process that did not have the level abstractions or other means to distinguish and ignore such "ignorables" on display. > > (b) The other aspect of Mark's statement quoted above is this: > > "just as current fonts that don't fully support ZWJ > cause it to break ligatures." > > This is incorrect. Current fonts which fully support ZWJ > *do* cause it to break ligatures, that is in fact a usage made > entirely explicit on Unicode 2.0 page 6-37 for Devanagari: This is entirely different from Mark's intent in this statement. > > KA + Virama + SSHA yields > KSSHA (the conjunct form) > > KA + Virama + ZWJ + SSA yields > half-KA + SSHA (the linked but not conjunct form). > > If the same change were introduced for Devanagari which is > claimed in the contradictory "should" of page 6-71 for the "fish" > example, then the introduction of ZWJ in Devanagari would > not block the ligature formation. > > I understand why Mark believes there is a contradiction > and points to Devanagari. No, I don't think you do. > > Somehow there evolved for Mark and I assume some others, > though clearly only some, an understanding of ZWJ that was > in contradiction to the basic statement of Unicode 2.0 page 6-70, > that was extended by the examples at the bottom of page 6-71, > and then in that light the Devanagari appears odd, > because its current wording keeps to the original and basic > wording of page 6-70. There is currently nothing special > about the use of ZWJ for Devanagari. It has the basic > interpretation of a linkable neighbor character for which > conjunct combinations are not defined by fonts, exactly as > in the basic wording of page 6-70. This latter statement I agree with. In Devanagari, the use of the ZWJ creates the context for the explicit half-form, which is a "right-linking" form of the consonant. It is then the presence of the right-linking form of the consonant that blocks the (otherwise automatic) conjunct formation (if the font supports it). In this sense, the use of a ZWJ in Devanagari can have the indirect effect of breaking a conjunct (i.e. ligature), and such usage is intentional in Devanagari. But the breaking of the conjunct is secondary -- and not the direct implication of a ZWJ requesting a ligature blocking. > *************************************************** > > 5. How did the contradiction in Unicode 2.0 arise? > How did the irregularity or special exception for Latin cursive ZWJ > arise? > > I cannot of course answer this with full assurances, since I do not know > all of the history, I think it would have behooved you to at least recover the published textual history of this -- as I have quoted above -- before speculating about these issues. > nor what went on in people's minds, but to the best of > my ability, I believe this: > > The basic definition of ZWJ and ZWNJ as merely neighbor characters, > which "trick" the unchanged, unmodified rendering process to producing > different forms (note these are under "layout controls" for exceptional cases) > always worked perfectly well. But whether for inherent conceptual reasons, > or because the *agentive* names "joiner" and "non-joiner" were used > instead of the *object* names "joinable" and "non-joinable", > or perhaps for both reasons, > people tended to interpret the ZWJ as a ligature request. No. This was explicitly allowed as part of the semantics of ZWJ in Unicode 1.0. It was explicitly defined out of the semantics of ZWJ in Unicode 1.1. Read the text. > > I had to be instructed at least twice that I was misinterpreting it, > that I had not understood the original definitions as mere neighbor > characters. I should not assume that others had the same problem, > but I believe they may have. Certainly outsiders not familiar with > the character standard have had this problem. This may well be the case. People will infer a lot from a name, without "reading the manual". > > Perhaps a review of the history of adoption of the wording towards > the bottom of page 6-71, the wording which contradicts the basic > wording on page 6-70, would reveal some other reason or reasons > why this special exception was adopted. I do not recall having > seen any other reasons given. I have provided the historical context above. > > As to Devanagari, for which the position of the ZWJ was > shifted somewhere between Unicode 1.0 and 2.0 > (I have not gone back to the details, it does not matter here), Again, the historical text is available. This occurred between Unicode 1.0 and Unicode 2.0. Its status in Unicode 1.1 was unchanged, since Unicode 1.1 did not rewrite the Devanagari script section from Volume 2 of Unicode 1.0. > I think the ZWJ was simply used at first as a wildcard, > forgetting about the basic definition of it as merely a neighbor > character, and the sequence > KA + ZWJ + Virama was proposed quite irregularly > in order to render a half-KA. This speculation completely ignores the actual history. The text from Unicode 1.0, Volume 2, p. 403 reads: " ... The ZWJ can be used to request that the virama be absorbed into the half consonant form, and prevent any further automatic formation of a ligature." That text was written and published *before* the semantic clarification and restrictions for ZWJ in Unicode 1.1. When the Devanagari section was edited and rewritten for consistency and incorporation into Unicode 2.0, it was noted that the text in Unicode 1.0, Volume 2 regarding ZWJ used this way was inconsistent with the new, restricted interpretation; ZWJ could not be used to cause a "virama [to] be absorbed into the half consonant form". Instead, the more rigorous model of the C + virama --> Cd and Cd + ZWJ --> Ch for Devanagari was introduced. > > Later this was corrected for Unicode 2.0, Yes. > but the wording now on page 6-71 treats this as if it were exceptional: > > "The function of the ZWJ may also have a particular interpretation > in specific scripts. For example, in Indic scripts it provides an > invisible neighbor to which a dead consonant may join in order to > induce a half-consonant form. ..." > > This is in fact no different than the Arabic case, and both are > completely consistent with the basic wording, Unicode 2.0 p.70. This is incidental. The text of page 6-71 points to a particular, script-specific usage. Yes, it is consistent with the generic sense of ZWJ, but in the context of Devanagari, the specific rules regarding half-consonant formation are invoked. Those *are* script-specific. > > *************************************************** > > 6. Can ZWJ be used as a ZWLigator? > > Mark Davis has proposed using ZWJ in the function of a ligature-request. > Can this be done? I think Mark's contribution on this was sufficiently clear that it *can* be done. > > Why would we choose that route? > > I must assume because Mark believes it is somehow easier > to do that than to introduce a new ZWL character. Not an assumption. That was explicit in Mark's contribution. The ZWJ and ZWNJ are already implemented (at least in part) fairly widely now. It was his judgement that tweaking the semantics of ZWJ in a way that would not foul up current implementations but which could provide the ligation request function for extended implementations, would be a less disruptive transition to the functionality that people are looking for than the introduction of a distinct, new ZWL. This assessment is, of course, the main issue, and is precisely what will be debated by UTC> > > *** > > What if we do introduce a ZWL character? > > If we did introduce a ZWL character, it need have no new or unique > character properties at all. Incorrect. The most important issue is precisely that it *does* introduce a new character property: ligation request. That property is new, because no character currently has it. That is the whole reason for requesting the encoding of such a character in the first place. > It should be expected to work exactly like an > Arabic floating vowel in not interrupting cursive linking or ligatures. > > Such a ZWL would be distinctively special *only* to the extent that fonts > could use it by treating it as a dummy character with no other uses > than to be part of triples of the type > Hungarian Runes [d + ZWL + d] to be rendered as
ligature. This is how a font might implement the ligatures involving ZWL, but that is not the end of the story. The software itself has to be cognizant of the property in some way: that is needed in order for inputting, by whatever means, to work correctly. Also, the software will need to have hierarchies of interaction between global ligation settings and local ligation requests (or blockages), so that the appropriate thing(s) can be done to ensure that local preferences correctly override global settings, and so on. > *** > > What if we use ZWJ to do double duty as ZWL? > > The thing to watch out for here is overloading a character > with non-analogous uses, to the extent that a contradiction might arise. This concern should be addressed as Mark has -- with an explicit listing of all the contrastive possibilities, matched up against the expected outcomes. If there are more expected outcomes than can reasonably be handled by judicious expanding of the semantics of ZWJ, then perhaps an independent ZWL is warranted. If not, then not. > > Therefore I remain convinced that the first alternative is better > than any that has been proposed, add a ZWL as described above. > > The functionality is needed. > I very much hope Unicode will proceed with it, > and avoid the risk of mixing it with ZWJ. > Debate to ensue based on the explicit listing we need to determine which approach is best. --Ken > Lloyd Anderson > Ecological Linguistics >