L2/04-359

Date: October 5, 2004
Title: Combining Classes & Typographical Interaction
Author: Ken Whistler
Status: Proposal for consideration by the UTC


Background

In reviewing the text of Unicode 4.0 and the various UTC decisions
for possible updates for Unicode 5.0, I have come across a particularly
tough nut that I think requires some UTC discussion and explicit
decision.

The issue is scattered in several places in the text, but the
essential core of the problem is represented by the subsection
entitled "Combining Classes", pp. 83-84 in TUS 4.0.

Effectively the problem is that the standard uses combining classes
for the canonical ordering algorithm, to create equivalence classes
for normalization, and defines:

  "...sequences of nonspacing marks as equivalent if they do
   not typographically interact."

However, the standard does not actually define what "typographically
interact" means. It goes on to state:

  "Characters have the same class if they interact typographically,
   and different classes if they do not."

However, this assertion is now viewed by many implementers of the
standard as either tautologous or erroneous (or both), because there
are, in fact, combining marks which have different combining classes
in the standard which *do* interact typographically, at least by
most typographers' and script implementers' definitions of what
"interaction" could mean in that context. This has particularly been
a problem for Hebrew and Arabic. And it has become a *political*
problem for the Consortium, because the stability policy constraints
for the standard have put the UTC in the position now of being
unable to adjust combining classes for Hebrew or Arabic combining
marks, even in clear instances where the current assignments are
not optimal and where our assertions that characters have the
same combining class if they interact typographically is flat wrong.

I propose that within the constraints of what we can accomplish
at this point, that we fix this problem by:

1. Defining the combining classes formally as simply positional
   classes that neither imply nor prohibit typographical interaction.

2. Explain (but not normatively) that the *intent* of the design
   is to minimize the number of instances where alternative sequences
   of multiple combining marks will result in identical visual
   sequences while not being considered canonical equivalents,
   and relate that intent specifically to the behavior of nonspacing
   marks used as accents for Latin, Greek, etc., where the
   normalization problem is particularly acute. (pun intended ;-) )

3. Stop pretending that we are ever going to be able to shift around
   the absolute values of combining classes at this point, and simply
   nail them all down normatively. The standard introduced numerical
   combining classes in 1996, and in 8 years we have *never* moved
   the value of a class. And the *only* change made to particular
   values was to move a whole bunch of characters from having
   non-zero combining class values to zero combining class in the
   Unicode 3.0 timeframe, in preparation for normalization. At this
   points, increasing numbers of applications, and even the standard
   itself, are referring to particular combining classes with
   expressions like {cc=230}, and we are *never* going to be able
   to change that.

4. If we agree to item 3, then we can normatively define the "fixed
   position classes" that we mention on occasion, but have never
   fully defined, because the range itself was in principle not
   stable.

5. Based on item 4, then introduce explanatory text in the standard
   as to why the fixed position combining classes were introduced in
   the first place, the problems they pose for the scripts that
   have combining marks assigned these classes (Hebrew, Arabic,
   Thai, Lao, Tibetan -- I don't think the other few instances cause
   problems) -- and the countervailing problem in some scripts which
   have typographic interactions that result in visibly identical
   forms with non-equivalent sequences (e.g. Khmer, Myanmar).

6. Finally, based on 5, I could appropriately introduce the text that
   I was tasked to add to the standard, explaining the potential use
   of CGJ to provide a partial solution to some of the problems in
   the first case, and in particular for the ordering of Hebrew points
   and accents.