L2/05-127
Title: Proposed Revision of Text Regarding Combining Characters
in Chapter 3 of the Unicode Standard
Date: May 4, 2005
Author: Ken Whistler
Background
I was charged in a series of action items with coming up with
verious clarifications regarding issues of combining mark
interaction, combining classes, and canonical ordering.
This arose as a result of a number of discussions focussed
on the interaction of combining marks in Hebrew, Arabic, and
Southeast Asian scripts, and various misunderstandings and
arguments that have arisen based on incompatible interpretations
of existing text and examples regarding these issues.
The draft text I propose below attempts to address these
issues, and is provided for discussion, in a hope that the
UTC can reach a general consensus regarding the direction
I am proposing that the language be taken, so that it can
then be turned back over to the editorial committee for
further wordsmithing for addition to the Unicode 5.0
draft text in preparation.
The basic innovation here is to attempt to cut through
the Gordian knot by sharply distinguishing formally
between a "combining character sequence" and a "grapheme
cluster", and between the notion of "dependence" of
a combining mark on its base and "application" of a
nonspacing mark on its grapheme base.
I then restate most of the existing discussion in Section
3.11 regarding application of combining marks using
the revised terminology, to eliminate a lot of the
current waffling and confusion in that section.
"Combining class" is defined precisely in terms of the
*property* -- which may seem tautological. But what that
accomplishes is to remove the concept of typographical
interaction from the definition per se. That was where
we ran into most of the problems in the concept. It also
means that typographical interaction can be independently
defined, and we can then determine how it does (and does
not) line up with the treatment of combining class.
The approach I have taken also makes it possible to
distinguish between the general principles of graphical
application of combining marks and the formal definition
of canonical ordering. The latter is an algorithm based
purely on combining character sequences and combining
class values.
================== draft text, Section 3.6 additions =============
[[ To make sense, the rewrite of Section 3.11 requires the
prior definition of grapheme cluster and related terms. As
it stands currently, we are trying to talk about this without
having terms defined, and then point out to UAX #29, where
the terms *also* aren't defined, but where there is
a rule for finding boundaries, instead. ]]
[[ First, rewrite D15 to make it more precise: ]]
D15 Nonspacing mark: A combining character with the property
[General_Category = Mn] or [General_Category = Me].
* The position of a nonspacing mark in presentation is dependent
on its base character. It generally does not consume space along
the visual baseline in and of itself.
[[ Retain all text from the existing bullet for D15 ]]
D15a Enclosing mark: A nonspacing mark with the property
[General_Category = Me].
* Enclosing marks are a subclass of nonspacing marks which
surround a base character, rather than merely being placed
over, under, or through it.
[[ Retain all the text of D17 and D17a, and their bullets. ]],
[[ Next, add the following definitions: ]]
D17b Standard Korean syllable block: A sequence of one or more
conjoining jamos and or Hangul syllables which conforms to the
specification of Section 3.12, "Conjoining Jamo Behavior".
* A standard Korean syllable block consists either of a precomposed
Hangul syllable, its equivalent using conjoining jamos, or various
extensions using conjoining jamos to form allowable Old Korean
syllable blocks.
D17c Grapheme base: A character with the property [Grapheme_Base = True],
or any standard Korean syllable block.
* Characters with the property [Grapheme_Base = True] include all
base characters plus most spacing marks.
* The concept of a grapheme base is introduced to simplify discussion
of the graphical application of nonspacing marks to other elements
of text. Note that a grapheme base may consist of a spacing
(combining) mark, which distinguishes it from a base character,
per se. A grapheme base may also itself consist of a sequence
of characters, in the case of the standard Korean syllable block.
D17d Grapheme extender: A character with the property [Grapheme_Extend
= True].
* Grapheme extender characters consist of all nonspacing marks,
ZERO WIDTH JOINER, ZERO WIDTH NON-JOINER, and a small number of
spacing marks.
* A grapheme extender can be conceived of primarily as the kind
of nonspacing graphical mark which gets applied above or below
another spacing character.
* ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER are formally defined
to be grapheme extenders so that their presence does not break
up a sequence of other grapheme extenders.
* The small number of spacing marks which have the property
[Grapheme_Extend = True] are all the second parts of a
two-part combining mark.
D17c Grapheme cluster: A maximal character sequence consisting of a
grapheme base followed by zero or more grapheme extenders.
* The grapheme cluster represents a horizontally segmentable
unit of text, consisting of some grapheme base (which may
consist of a Korean syllable) together with any number of
nonspacing marks applied to it.
* A grapheme cluster is similar to, but not identical to a combining
character sequence. A combining character sequence starts with
a base character, and extends across any subsequent
sequence of combining marks, nonspacing or spacing.
A combining character sequence is most directly relevant to
processing issues related to normalization, comparison, and searching.
* A grapheme cluster starts with a grapheme base,
and extends across any subsequent sequence of nonspacing
marks. A grapheme cluster is most directly relevant to text
rendering and such processes as cursor placement and text
selection in editing.
======================= draft text, Section 3.11 ======================
3.11 Canonical Ordering Behavior
This section provides a formal statement of canonical ordering
behavior, which determines, for the purposes of interpretation,
which combining character sequences are to be considered
equivalent. A precise definition of equivalence is required,
so that text containing combining character sequences can be
created and interchanged in a predictable way.
When combining sequences contain multiple combining characters,
different sequences can contain the same characters, but in a
different order. Under certain circumstances two such sequences may
be equivalent, even though they differ in the order of the combining
characters.
Canonical ordering is a process of specifying a defined order
for sequences of combining marks, whereby it is possible to
determine definitively which sequences are equivalent and
which are not.
Canonical ordering behavior, and more specifically, canonical
ordering, is a required part of the normative
specification of normalization for the Unicode Standard. See
Unicode Standard Annex #15, "Unicode Normalization Forms."
Canonical ordering is also a required part of the separate
standard, Unicode Technical Standard #10, "Unicode Collation
Algorithm."
This section is structured in the following way. First, a set of
normative principles regarding the application of combining
characters are presented. Second, definitions are given for
combining class and several related concepts. Finally,
the Unicode algorithm for canonical ordering itself is specified.
[[ The text draft to this point is to replace the first paragraph
of the existing Section 3.11. ]]
Application of Combining Marks
There are a number of principles in the Unicode Standard
regarding the application of combining marks. These
principles are listed in this section, with an indication
of which are considered to be normative and which are
considered to be guidelines.
In particular, guidelines for rendering of combining
marks in conjunction with other characers should be
considered as appropriate for defining default rendering
behavior, in the absence of more specific information
about rendering. It is often the case that combining
marks in complex scripts, or even particular, general-use
non-spacing marks will have rendering requirements that
depart significantly from the general guidelines.
Rendering processes should, as appropriate, make use of
available information about specific typographic practices
and conventions, in order to produce best rendering of
text.
To help in the clarification of the principles regarding
the application of combining marks, a distinction is
made between notional dependence and graphical
application.
D46a Notional dependence: A combining mark is said to
depend on its associated base character.
* The associated base character is the base character
in the combining character sequence that a combining
mark is part of.
* A combining mark in a defective combining character
sequence has no associated base character, and thus
cannot be said to depend on any particular base
character. This is one of the reasons why fallback
processing is required for defective combining character
sequences.
* Notional dependence concerns all combining
marks, including spacing marks and combining marks that
have no visible display.
D46b Graphical application: A nonspacing mark is said to
apply to its associated grapheme base.
* The associated grapheme base is the grapheme base in
the grapheme cluster that a nonspacing mark is part of.
* A nonspacing mark in a defective combining character
sequence is not part of a grapheme cluster, and is
subject to the same kinds of fallback processing as
for any defective combining character sequence.
* Graphic application concerns visual rendering issues,
and thus is an issue for nonspacing marks that have
visible glyphs. Those glyphs interact, in rendering,
with their grapheme base.
Throughout the text of the standard, whenever the situation
is clear, discussion of combining marks often simply
talks about combining marks "applying" to their base.
In the prototypical case, often illustrated, of a nonspacing
accent mark applying to a single base character letter, this
simplification is not problematical, because the nonspacing
mark both depends (notionally) on its base character and
simultaneously applies (graphically) to its grapheme base,
affecting its display. The finer distinctions are needed
when dealing with the edge cases, such as combining marks
that have no display glyph, graphical application of nonspacing
marks to Korean syllables, and the behavior of spacing
combining marks.
The distinction made here between notional dependence and
graphical application does not preclude spacing marks or
even sequences of base characters from having effects on
neighboring characters in rendering. Thus, spacing forms
of dependent vowels (matras) in Indic scripts,
may trigger particular kinds of conjunct formation, or
may be repositioned in ways that influence the rendering
of other characters. (See Chapter 9, South Asian Script-I,
for many examples.) Similarly, sequences of base characters may
also form ligatures in rendering. (See "Cursive Connection
and Ligatures" in Section 16.2, Layout Controls.)
The following listing specifies the principles
regarding application of combining marks.
P1 [Normative] Combining character order: Combining characters
follow the base character on which they depend.
* This principle follows from the definition of a combining
character sequence.
[[ Keep the following text from the existing bullet: ]]
* Thus the character sequence is unambiguously interpreted (and displayed)
as "Šu", not "aŸ".
P2 [Guideline] Inside-out application. Nonspacing marks with
the same combining class are generally positioned graphically
outward from the grapheme base to which they apply.
* The most numerous and important instances of this principle
involve nonspacing marks applied either directly above or below a
grapheme base.
* In a sequence of two nonspacing marks above a grapheme base,
the first nonspacing mark is placed directly above the
grapheme base, and the second is then placed above the
first nonspacing mark.
* In a sequence of two nonspacing marks below a grapheme base,
the first nonspacing mark is placed directly below the
grapheme base, and the second is then placed below the
first nonspacing mark.
* This rendering behavior for nonspacing marks can be generalized
to sequences of any length, although practical considerations
usually limit such sequences to no more than two or three
marks above and/or below a grapheme base.
* The principle of inside-out application is also referred to
as default stacking behavior for nonspacing marks.
P3 [Guideline] Side-by-side application. Notwithstanding the
principle of inside-out application, some specific nonspacing
marks may override the default stacking behavior and are
positioned side-by-side over (or under) a grapheme base,
rather than stacking vertically.
* Such side-by-side positioning may reflect language-specific
orthographic rules, such as for Vietnamese diacritics and
tone marks, or for polytonic Greek breathing and accent marks.
For examples, see Section 2.10, Combining Characters.
* When positioned side-by-side, the visual rendering order of
a sequence of non-spacing marks reflects the dominant order
of the script with which they are used. Thus in Greek, the
first non-spacing mark in such a sequence will be positioned
to the left side above a grapheme base, and the second to
the right side above the grapheme base. In Hebrew, the
opposite positioning is used for side-by-side placement.
P4 [Normative] Nondistinct order. Nonspacing marks with different,
non-zero combining classes may occur in different orders
without affecting either the visual display of a combining
character sequence or the interpretation of that sequence.
* For example, if one nonspacing mark occurs above a grapheme
base and another nonspacing mark occurs below, they will
have distinct combining classes, and the order in which
they occur in the combining character sequence does not
matter for the display or interpretation of the resulting
grapheme cluster.
* The introduction of the combining class for characters and
its use in canonical ordering in the standard is to
precisely define canonical equivalence, and thereby to
clarify exactly which such alternate sequences must be
considered as identical for display and interpretation.
P5 [Guideline] Enclosing marks surround their grapheme base
and any intervening nonspacing marks.
* This implies that enclosing marks successively surround
previous enclosing marks. See Figure 3-1.
[[ Retain Figure 3-1 here. ]]
* Dynamic application of enclosing marks, particularly
sequences of enclosing marks, is beyond the capability
of most fonts and simple rendering processes. so it is
not unexpected to find fallback rendering in cases such
as that illustrated in Figure 3-1.
P6 [Guideline] Double diacritic nonspacing marks, such as
U+0360 COMBINING DOUBLE TILDE, apply to their grapheme base,
but are intended to be rendered with glyphs that encompass
a following grapheme base as well. See Figure 7-7 for an
example.
* Because such double diacritic display spans combinations
of elements which would otherwise be considered grapheme
clusters, the support of double diacritics in rendering
may involve special handling for cursor placement and
text selection.
P7 [Guideline] When double diacritic nonspacing marks interact
with normal nonspacing marks in a grapheme cluster, they
"float" to the outermost layer of the stack of rendered
marks (either above or below). See Figure 7-8 for an example.
* This behavior can be conceived of as a kind of looser binding
of such double diacritics to their bases. In effect, all
other nonspacing marks are applied first, and then the
double diacritic will span the resulting stacks.
* Double diacritic nonspacing marks are also given a very
high combining class, so that in canonical order they appear
at or near the end of any combining character sequence.
* The interaction of enclosing marks and double diacritics
is not well-defined graphically. It is unlikely that most
fonts or rendering processes could handle combinations of
these felicitously. It is not recommended to use combinations
of these together in the same grapheme cluster.
Combining Marks and Korean Syllables
[[ Keep the current text from the Application of Combining Marks
section on p. 85 of the 13 Jan 05 draft, from the paragraph
starting "When a grapheme cluster comprises a Korean syllable..."
to the paragraph ending "...that implementations do not follow it." ]]
For more information on the recommended use of the
combining grapheme joiner, see the subsection "Combining
Grapheme Joiner" in Section 16.2, Layout Controls.
For more discussion regarding the application of combining
marks in general, see Section 7.9, Combining Marks.
Each character in the Unicode Standard has a combining class
associated with it. The combining class is a numerical value
used by the canonical ordering algorithm to determine which
sequences of combining marks are to be considered canonically
equivalent and which are not. Canonical equivalence is
the criterion for whether two alternate sequences are considered
identical for interpretation.
D46 Combining class: A numeric value in the range 0..255 given
to each Unicode code point, formally defined as the
property Canonical_Combining_Class.
* The combining class for each encoded character in the standard
is specified in the file UnicodeData.txt in the Unicode
Character Database. Any code point not listed in that data
file defaults to [Canonical_Combining_Class = 0] ( or [ccc = 0]
for short).
* An extracted listing of combining classes, sorted by numeric
value, is provided in the file DerivedCombiningClass.txt in
the Unicode Character Database.
* Only combining marks have a combining class other than zero.
Almost all combining marks with a class other than zero are
also nonspacing marks, but there are a few exceptions. And
not all nonspacing marks have a non-zero combining class.
So while the correlation between ~[ccc = 0] and [gc = Mn]
is close, it is not exact, and implementations should not
depend on the two concepts being identical.
D46c Fixed position class: A subset of the range of numeric values
for combining classes, specifically any value in the range
10..199.
* Fixed position classes are assigned to a small number of
Hebrew, Arabic, Syriac, Telugu, Thai, Lao, and Tibetan
combining marks whose position was conceived of as occurring
in a fixed position with respect to their grapheme base,
regardless of any other combining mark which might also apply
to that grapheme base.
* Not all Arabic vowel points or Indic matras are given fixed
position classes. The existence of fixed position classes
in the standard is an historical artifact of an earlier stage
in its development, prior to the formal standardization of
the Unicode Normalization Forms.
D46d Typographic interaction: Graphical application of one
nonspacing mark in a position relative to a grapheme base
that is already occupied by another nonspacing mark, so
that some rendering adjustment must be done (such as
default stacking or side-by-side placement) to avoid
illegible overprinting or crashing of glyphs.
The assignment of combining class values for Unicode characters was
originally done with the goal in mind of defining distinct numeric
values for each group of nonspacing marks that would typographically
interact. Thus all generic nonspacing marks above are given the
value [ccc = 230], while all generic nonspacing marks below are
given the value [ccc = 220]. Smaller numbers of nonspacing marks
which tend to sit on one "shoulder" or another of a grapheme base,
or which may actually be attached to the grapheme base itself when
applied, have their own combining classes.
When assigned this way, canonical ordering assures that, in
general, alternate sequences of combining characters that
typographically interact will not be canonically equivalent,
whereas alternate sequences of combining characters that do not
typographically interact will be canonically equivalent.
This is roughly correct for the normal cases of detached, generic
nonspacing marks placed above and below base letters. However,
the ramifications of complex rendering for many scripts ensure
that there are always some edge cases where there may be
typographic interaction between combining marks of distinct
combining classes. This has turned out to be particularly true
for some of the fixed position classes for Hebrew and Arabic,
for which a distinct combining class is no guarantee that there
will be no typographic interaction for rendering.
Because of these considerations, particular combining class
values should only be taken as a guideline regarding issues
of typographic interaction of combining marks.
The only normative use of combining class values is
as input to the canonical ordering algorithm, where they are
used to normatively distinguish between sequences of combining
marks that are canonically equivalent and those which are not.
============================================================
[[ And then finally, the subsection on canonical ordering and collation
needs a rewrite to basically say that the Unicode Standard per
se places no requirements, other than honoring canonical equivalence,
and that further specifications are made in the UCA. ]]
[[ We also need to further emphasize the difference between canonical
order and such concepts as linguistic order or preferred order for
ease of implementation in fonts, etc., and point to CGJ as a
mechanism for interrupting canonical re-ordering in special
cases. ]]