L2/05-127 Title: Proposed Revision of Text Regarding Combining Characters in Chapter 3 of the Unicode Standard Date: May 4, 2005 Author: Ken Whistler Background I was charged in a series of action items with coming up with verious clarifications regarding issues of combining mark interaction, combining classes, and canonical ordering. This arose as a result of a number of discussions focussed on the interaction of combining marks in Hebrew, Arabic, and Southeast Asian scripts, and various misunderstandings and arguments that have arisen based on incompatible interpretations of existing text and examples regarding these issues. The draft text I propose below attempts to address these issues, and is provided for discussion, in a hope that the UTC can reach a general consensus regarding the direction I am proposing that the language be taken, so that it can then be turned back over to the editorial committee for further wordsmithing for addition to the Unicode 5.0 draft text in preparation. The basic innovation here is to attempt to cut through the Gordian knot by sharply distinguishing formally between a "combining character sequence" and a "grapheme cluster", and between the notion of "dependence" of a combining mark on its base and "application" of a nonspacing mark on its grapheme base. I then restate most of the existing discussion in Section 3.11 regarding application of combining marks using the revised terminology, to eliminate a lot of the current waffling and confusion in that section. "Combining class" is defined precisely in terms of the *property* -- which may seem tautological. But what that accomplishes is to remove the concept of typographical interaction from the definition per se. That was where we ran into most of the problems in the concept. It also means that typographical interaction can be independently defined, and we can then determine how it does (and does not) line up with the treatment of combining class. The approach I have taken also makes it possible to distinguish between the general principles of graphical application of combining marks and the formal definition of canonical ordering. The latter is an algorithm based purely on combining character sequences and combining class values. ================== draft text, Section 3.6 additions ============= [[ To make sense, the rewrite of Section 3.11 requires the prior definition of grapheme cluster and related terms. As it stands currently, we are trying to talk about this without having terms defined, and then point out to UAX #29, where the terms *also* aren't defined, but where there is a rule for finding boundaries, instead. ]] [[ First, rewrite D15 to make it more precise: ]] D15 Nonspacing mark: A combining character with the property [General_Category = Mn] or [General_Category = Me]. * The position of a nonspacing mark in presentation is dependent on its base character. It generally does not consume space along the visual baseline in and of itself. [[ Retain all text from the existing bullet for D15 ]] D15a Enclosing mark: A nonspacing mark with the property [General_Category = Me]. * Enclosing marks are a subclass of nonspacing marks which surround a base character, rather than merely being placed over, under, or through it. [[ Retain all the text of D17 and D17a, and their bullets. ]], [[ Next, add the following definitions: ]] D17b Standard Korean syllable block: A sequence of one or more conjoining jamos and or Hangul syllables which conforms to the specification of Section 3.12, "Conjoining Jamo Behavior". * A standard Korean syllable block consists either of a precomposed Hangul syllable, its equivalent using conjoining jamos, or various extensions using conjoining jamos to form allowable Old Korean syllable blocks. D17c Grapheme base: A character with the property [Grapheme_Base = True], or any standard Korean syllable block. * Characters with the property [Grapheme_Base = True] include all base characters plus most spacing marks. * The concept of a grapheme base is introduced to simplify discussion of the graphical application of nonspacing marks to other elements of text. Note that a grapheme base may consist of a spacing (combining) mark, which distinguishes it from a base character, per se. A grapheme base may also itself consist of a sequence of characters, in the case of the standard Korean syllable block. D17d Grapheme extender: A character with the property [Grapheme_Extend = True]. * Grapheme extender characters consist of all nonspacing marks, ZERO WIDTH JOINER, ZERO WIDTH NON-JOINER, and a small number of spacing marks. * A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark which gets applied above or below another spacing character. * ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders. * The small number of spacing marks which have the property [Grapheme_Extend = True] are all the second parts of a two-part combining mark. D17c Grapheme cluster: A maximal character sequence consisting of a grapheme base followed by zero or more grapheme extenders. * The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it. * A grapheme cluster is similar to, but not identical to a combining character sequence. A combining character sequence starts with a base character, and extends across any subsequent sequence of combining marks, nonspacing or spacing. A combining character sequence is most directly relevant to processing issues related to normalization, comparison, and searching. * A grapheme cluster starts with a grapheme base, and extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and such processes as cursor placement and text selection in editing. ======================= draft text, Section 3.11 ====================== 3.11 Canonical Ordering Behavior This section provides a formal statement of canonical ordering behavior, which determines, for the purposes of interpretation, which combining character sequences are to be considered equivalent. A precise definition of equivalence is required, so that text containing combining character sequences can be created and interchanged in a predictable way. When combining sequences contain multiple combining characters, different sequences can contain the same characters, but in a different order. Under certain circumstances two such sequences may be equivalent, even though they differ in the order of the combining characters. Canonical ordering is a process of specifying a defined order for sequences of combining marks, whereby it is possible to determine definitively which sequences are equivalent and which are not. Canonical ordering behavior, and more specifically, canonical ordering, is a required part of the normative specification of normalization for the Unicode Standard. See Unicode Standard Annex #15, "Unicode Normalization Forms." Canonical ordering is also a required part of the separate standard, Unicode Technical Standard #10, "Unicode Collation Algorithm." This section is structured in the following way. First, a set of normative principles regarding the application of combining characters are presented. Second, definitions are given for combining class and several related concepts. Finally, the Unicode algorithm for canonical ordering itself is specified. [[ The text draft to this point is to replace the first paragraph of the existing Section 3.11. ]]
Application of Combining Marks
There are a number of principles in the Unicode Standard regarding the application of combining marks. These principles are listed in this section, with an indication of which are considered to be normative and which are considered to be guidelines. In particular, guidelines for rendering of combining marks in conjunction with other characers should be considered as appropriate for defining default rendering behavior, in the absence of more specific information about rendering. It is often the case that combining marks in complex scripts, or even particular, general-use non-spacing marks will have rendering requirements that depart significantly from the general guidelines. Rendering processes should, as appropriate, make use of available information about specific typographic practices and conventions, in order to produce best rendering of text. To help in the clarification of the principles regarding the application of combining marks, a distinction is made between notional dependence and graphical application. D46a Notional dependence: A combining mark is said to depend on its associated base character. * The associated base character is the base character in the combining character sequence that a combining mark is part of. * A combining mark in a defective combining character sequence has no associated base character, and thus cannot be said to depend on any particular base character. This is one of the reasons why fallback processing is required for defective combining character sequences. * Notional dependence concerns all combining marks, including spacing marks and combining marks that have no visible display. D46b Graphical application: A nonspacing mark is said to apply to its associated grapheme base. * The associated grapheme base is the grapheme base in the grapheme cluster that a nonspacing mark is part of. * A nonspacing mark in a defective combining character sequence is not part of a grapheme cluster, and is subject to the same kinds of fallback processing as for any defective combining character sequence. * Graphic application concerns visual rendering issues, and thus is an issue for nonspacing marks that have visible glyphs. Those glyphs interact, in rendering, with their grapheme base. Throughout the text of the standard, whenever the situation is clear, discussion of combining marks often simply talks about combining marks "applying" to their base. In the prototypical case, often illustrated, of a nonspacing accent mark applying to a single base character letter, this simplification is not problematical, because the nonspacing mark both depends (notionally) on its base character and simultaneously applies (graphically) to its grapheme base, affecting its display. The finer distinctions are needed when dealing with the edge cases, such as combining marks that have no display glyph, graphical application of nonspacing marks to Korean syllables, and the behavior of spacing combining marks. The distinction made here between notional dependence and graphical application does not preclude spacing marks or even sequences of base characters from having effects on neighboring characters in rendering. Thus, spacing forms of dependent vowels (matras) in Indic scripts, may trigger particular kinds of conjunct formation, or may be repositioned in ways that influence the rendering of other characters. (See Chapter 9, South Asian Script-I, for many examples.) Similarly, sequences of base characters may also form ligatures in rendering. (See "Cursive Connection and Ligatures" in Section 16.2, Layout Controls.) The following listing specifies the principles regarding application of combining marks. P1 [Normative] Combining character order: Combining characters follow the base character on which they depend. * This principle follows from the definition of a combining character sequence. [[ Keep the following text from the existing bullet: ]] * Thus the character sequence is unambiguously interpreted (and displayed) as "Šu", not "aŸ". P2 [Guideline] Inside-out application. Nonspacing marks with the same combining class are generally positioned graphically outward from the grapheme base to which they apply. * The most numerous and important instances of this principle involve nonspacing marks applied either directly above or below a grapheme base. * In a sequence of two nonspacing marks above a grapheme base, the first nonspacing mark is placed directly above the grapheme base, and the second is then placed above the first nonspacing mark. * In a sequence of two nonspacing marks below a grapheme base, the first nonspacing mark is placed directly below the grapheme base, and the second is then placed below the first nonspacing mark. * This rendering behavior for nonspacing marks can be generalized to sequences of any length, although practical considerations usually limit such sequences to no more than two or three marks above and/or below a grapheme base. * The principle of inside-out application is also referred to as default stacking behavior for nonspacing marks. P3 [Guideline] Side-by-side application. Notwithstanding the principle of inside-out application, some specific nonspacing marks may override the default stacking behavior and are positioned side-by-side over (or under) a grapheme base, rather than stacking vertically. * Such side-by-side positioning may reflect language-specific orthographic rules, such as for Vietnamese diacritics and tone marks, or for polytonic Greek breathing and accent marks. For examples, see Section 2.10, Combining Characters. * When positioned side-by-side, the visual rendering order of a sequence of non-spacing marks reflects the dominant order of the script with which they are used. Thus in Greek, the first non-spacing mark in such a sequence will be positioned to the left side above a grapheme base, and the second to the right side above the grapheme base. In Hebrew, the opposite positioning is used for side-by-side placement. P4 [Normative] Nondistinct order. Nonspacing marks with different, non-zero combining classes may occur in different orders without affecting either the visual display of a combining character sequence or the interpretation of that sequence. * For example, if one nonspacing mark occurs above a grapheme base and another nonspacing mark occurs below, they will have distinct combining classes, and the order in which they occur in the combining character sequence does not matter for the display or interpretation of the resulting grapheme cluster. * The introduction of the combining class for characters and its use in canonical ordering in the standard is to precisely define canonical equivalence, and thereby to clarify exactly which such alternate sequences must be considered as identical for display and interpretation. P5 [Guideline] Enclosing marks surround their grapheme base and any intervening nonspacing marks. * This implies that enclosing marks successively surround previous enclosing marks. See Figure 3-1. [[ Retain Figure 3-1 here. ]] * Dynamic application of enclosing marks, particularly sequences of enclosing marks, is beyond the capability of most fonts and simple rendering processes. so it is not unexpected to find fallback rendering in cases such as that illustrated in Figure 3-1. P6 [Guideline] Double diacritic nonspacing marks, such as U+0360 COMBINING DOUBLE TILDE, apply to their grapheme base, but are intended to be rendered with glyphs that encompass a following grapheme base as well. See Figure 7-7 for an example. * Because such double diacritic display spans combinations of elements which would otherwise be considered grapheme clusters, the support of double diacritics in rendering may involve special handling for cursor placement and text selection. P7 [Guideline] When double diacritic nonspacing marks interact with normal nonspacing marks in a grapheme cluster, they "float" to the outermost layer of the stack of rendered marks (either above or below). See Figure 7-8 for an example. * This behavior can be conceived of as a kind of looser binding of such double diacritics to their bases. In effect, all other nonspacing marks are applied first, and then the double diacritic will span the resulting stacks. * Double diacritic nonspacing marks are also given a very high combining class, so that in canonical order they appear at or near the end of any combining character sequence. * The interaction of enclosing marks and double diacritics is not well-defined graphically. It is unlikely that most fonts or rendering processes could handle combinations of these felicitously. It is not recommended to use combinations of these together in the same grapheme cluster. Combining Marks and Korean Syllables [[ Keep the current text from the Application of Combining Marks section on p. 85 of the 13 Jan 05 draft, from the paragraph starting "When a grapheme cluster comprises a Korean syllable..." to the paragraph ending "...that implementations do not follow it." ]] For more information on the recommended use of the combining grapheme joiner, see the subsection "Combining Grapheme Joiner" in Section 16.2, Layout Controls. For more discussion regarding the application of combining marks in general, see Section 7.9, Combining Marks.
Combining Classes
Each character in the Unicode Standard has a combining class associated with it. The combining class is a numerical value used by the canonical ordering algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not. Canonical equivalence is the criterion for whether two alternate sequences are considered identical for interpretation. D46 Combining class: A numeric value in the range 0..255 given to each Unicode code point, formally defined as the property Canonical_Combining_Class. * The combining class for each encoded character in the standard is specified in the file UnicodeData.txt in the Unicode Character Database. Any code point not listed in that data file defaults to [Canonical_Combining_Class = 0] ( or [ccc = 0] for short). * An extracted listing of combining classes, sorted by numeric value, is provided in the file DerivedCombiningClass.txt in the Unicode Character Database. * Only combining marks have a combining class other than zero. Almost all combining marks with a class other than zero are also nonspacing marks, but there are a few exceptions. And not all nonspacing marks have a non-zero combining class. So while the correlation between ~[ccc = 0] and [gc = Mn] is close, it is not exact, and implementations should not depend on the two concepts being identical. D46c Fixed position class: A subset of the range of numeric values for combining classes, specifically any value in the range 10..199. * Fixed position classes are assigned to a small number of Hebrew, Arabic, Syriac, Telugu, Thai, Lao, and Tibetan combining marks whose position was conceived of as occurring in a fixed position with respect to their grapheme base, regardless of any other combining mark which might also apply to that grapheme base. * Not all Arabic vowel points or Indic matras are given fixed position classes. The existence of fixed position classes in the standard is an historical artifact of an earlier stage in its development, prior to the formal standardization of the Unicode Normalization Forms. D46d Typographic interaction: Graphical application of one nonspacing mark in a position relative to a grapheme base that is already occupied by another nonspacing mark, so that some rendering adjustment must be done (such as default stacking or side-by-side placement) to avoid illegible overprinting or crashing of glyphs. The assignment of combining class values for Unicode characters was originally done with the goal in mind of defining distinct numeric values for each group of nonspacing marks that would typographically interact. Thus all generic nonspacing marks above are given the value [ccc = 230], while all generic nonspacing marks below are given the value [ccc = 220]. Smaller numbers of nonspacing marks which tend to sit on one "shoulder" or another of a grapheme base, or which may actually be attached to the grapheme base itself when applied, have their own combining classes. When assigned this way, canonical ordering assures that, in general, alternate sequences of combining characters that typographically interact will not be canonically equivalent, whereas alternate sequences of combining characters that do not typographically interact will be canonically equivalent. This is roughly correct for the normal cases of detached, generic nonspacing marks placed above and below base letters. However, the ramifications of complex rendering for many scripts ensure that there are always some edge cases where there may be typographic interaction between combining marks of distinct combining classes. This has turned out to be particularly true for some of the fixed position classes for Hebrew and Arabic, for which a distinct combining class is no guarantee that there will be no typographic interaction for rendering. Because of these considerations, particular combining class values should only be taken as a guideline regarding issues of typographic interaction of combining marks. The only normative use of combining class values is as input to the canonical ordering algorithm, where they are used to normatively distinguish between sequences of combining marks that are canonically equivalent and those which are not. ============================================================ [[ And then finally, the subsection on canonical ordering and collation needs a rewrite to basically say that the Unicode Standard per se places no requirements, other than honoring canonical equivalence, and that further specifications are made in the UCA. ]] [[ We also need to further emphasize the difference between canonical order and such concepts as linguistic order or preferred order for ease of implementation in fonts, etc., and point to CGJ as a mechanism for interrupting canonical re-ordering in special cases. ]]