L2/04-359 Date: October 5, 2004 Title: Combining Classes & Typographical Interaction Author: Ken Whistler Status: Proposal for consideration by the UTC Background In reviewing the text of Unicode 4.0 and the various UTC decisions for possible updates for Unicode 5.0, I have come across a particularly tough nut that I think requires some UTC discussion and explicit decision. The issue is scattered in several places in the text, but the essential core of the problem is represented by the subsection entitled "Combining Classes", pp. 83-84 in TUS 4.0. Effectively the problem is that the standard uses combining classes for the canonical ordering algorithm, to create equivalence classes for normalization, and defines: "...sequences of nonspacing marks as equivalent if they do not typographically interact." However, the standard does not actually define what "typographically interact" means. It goes on to state: "Characters have the same class if they interact typographically, and different classes if they do not." However, this assertion is now viewed by many implementers of the standard as either tautologous or erroneous (or both), because there are, in fact, combining marks which have different combining classes in the standard which *do* interact typographically, at least by most typographers' and script implementers' definitions of what "interaction" could mean in that context. This has particularly been a problem for Hebrew and Arabic. And it has become a *political* problem for the Consortium, because the stability policy constraints for the standard have put the UTC in the position now of being unable to adjust combining classes for Hebrew or Arabic combining marks, even in clear instances where the current assignments are not optimal and where our assertions that characters have the same combining class if they interact typographically is flat wrong. I propose that within the constraints of what we can accomplish at this point, that we fix this problem by: 1. Defining the combining classes formally as simply positional classes that neither imply nor prohibit typographical interaction. 2. Explain (but not normatively) that the *intent* of the design is to minimize the number of instances where alternative sequences of multiple combining marks will result in identical visual sequences while not being considered canonical equivalents, and relate that intent specifically to the behavior of nonspacing marks used as accents for Latin, Greek, etc., where the normalization problem is particularly acute. (pun intended ;-) ) 3. Stop pretending that we are ever going to be able to shift around the absolute values of combining classes at this point, and simply nail them all down normatively. The standard introduced numerical combining classes in 1996, and in 8 years we have *never* moved the value of a class. And the *only* change made to particular values was to move a whole bunch of characters from having non-zero combining class values to zero combining class in the Unicode 3.0 timeframe, in preparation for normalization. At this points, increasing numbers of applications, and even the standard itself, are referring to particular combining classes with expressions like {cc=230}, and we are *never* going to be able to change that. 4. If we agree to item 3, then we can normatively define the "fixed position classes" that we mention on occasion, but have never fully defined, because the range itself was in principle not stable. 5. Based on item 4, then introduce explanatory text in the standard as to why the fixed position combining classes were introduced in the first place, the problems they pose for the scripts that have combining marks assigned these classes (Hebrew, Arabic, Thai, Lao, Tibetan -- I don't think the other few instances cause problems) -- and the countervailing problem in some scripts which have typographic interactions that result in visibly identical forms with non-equivalent sequences (e.g. Khmer, Myanmar). 6. Finally, based on 5, I could appropriately introduce the text that I was tasked to add to the standard, explaining the potential use of CGJ to provide a partial solution to some of the problems in the first case, and in particular for the ordering of Hebrew points and accents.