L2/08-047

Title: Update Conformance Clauses for Definition of ECCS
Date: January 25, 2008
Source: Ken Whistler
Action: For discussion by joint UTC/L2 meeting
References: L2/07-389, UAX #29

Background

At the last UTC meeting, during the discussion about the Proposed Update for UAX #29 for Unicode 5.1, we came to a certain degree of consensus regarding the issues raised by Mark Davis in L2/07-389 regarding generalizing the notion of the combining character sequence (CCS) and default grapheme cluster (DGC), as used in UAX #29.

The original thrust of Mark's proposal was to extend the definition of DGC to include spacing combining marks as well as nonspacing combining marks, which would align the concept better with CCS, as well as with certain Indic syllable parts, but then to go beyond that to also take into account Thai/Lao visual order vowels, as well.

I objected to some of the terminological suggestions, and also made the point that we needed to first focus on the problem apparent in the UAX #29 tables that we were having to repeat the construction of combining character sequences *and* Hangul jamo sequences, because we don't have concepts or terms to deal with the fact that we treat Hangul syllables (even if composed on conjoining jamos) as unsegmentable bases for combining sequences. I suggested terms such as "Extended Base" and "Extended Combining Character Sequence (XCCS)" to account for this.

And in the end, Mark got an action item:

113-A020, Mark Davis & editorial committee, Add informative material re use of XCCS and default grapheme clusters in PU UAX #29.

When attempting to carry this out, however, the editorial committee ran into some rocky ground. We were still in a bit of a terminological mess in UAX #29, in part because UAX #29 is about segmentation algorithms, and defining what results between boundaries defined by a segmentation algorithm, including all the edge case conditions, is a rather different task than defining a combining character sequence or an extended combining character sequence per se.

Mark has updated the Proposed Update draft for UAX #29, incorporating some of the clarifications involved, and some new terminology about extended grapheme clusters. He incorporated review notes about the issues about terminology involving XCCS, which still need UTC input.

One of the things that Mark and I concluded, when working on this, is that the formal definition of XCCS (and Extended Base) really belong in Chapter 3 of the standard, along with Combining Character Sequence itself, rather than in UAX #29. This makes for a much cleaner terminological relationship between CCS and XCCS, and separates those definitions from the complications of defining the boundary determination rules for the segmentation algorithms in UAX #29.

So this document is my contribution towards specifying the exact definitions of XCCS and Extended Base needed for Chapter 3, to compliment the text needed in the Proposed Update for UAX #29.

Conceptual Framework

Before proposing the actual language of the definitions in Chapter 3, here is a brief summary of the framework for these terms, comparing the concepts of CCS and XCCS (for Chapter 3) with the concepts of Grapheme Cluster and Extended Grapheme Cluster (for UAX #29).

Combining Character Sequence (CCS)

Conceptually, this is a base character, followed by any number of combining marks. Actually, we've extended this concept a little now, to also allow for joiners. So the formal definition is:

CCS: Base? (M | ZWJ | ZWNJ)+

Note that single base character, not followed by any combining marks or a joiner is not a CCS -- a situation that causes some difficulty for use of the CCS concept in certain types of algorithms. Also, we have a special term, "Defective CCS", for a sequence of combining marks without a base character.

Note also that while we have defined canonical equivalence between Hangul syllables and sequences of conjoining jamo characters, the conjoining jamos are not formally combining marks (gc=M), but letters (gc=Lo). And that means that a sequence of conjoining jamos is just a sequence of two (or three) base characters, even though it is canonically equivalent to a Hangul syllable, which itself is a single base character.

Extended Combining Character Sequence (ECCS)

Conceptually, this just extends the concept of the CCS to incorporate all Hangul syllables as extended bases. So the formal definitions would be:

Extended_Base: (Base | Standard_Korean_Syllable_Block)

Where Standard_Korean_Syllable_Block is already defined in D119. Then for ECCS itself, you get:

ECCS: Extended_Base? (M | XWJ | ZWNJ)+

(I now suggest "ECCS" instead of "XCCS" to avoid mixup with the Xerox Coded Character Set, also used in Unicode discussions, and so the "E" in ECCS is parallel to the "E" in EGC below.)

Turning then to UAX #29, what we need there are rules for "character" segmentation which produce breaks reasonably consistent with what end users perceive of as "characters" (User-perceived characters, or UPC).

The existing rules for this in UAX #29 make use of the a Grapheme_Cluster_Break property (defined in GraphemeBreakProperty.txt). Some of the values of that property derive from other properties (Grapheme_Extend in PropList.txt, gc=Mc in UnicodeData.txt, and L, V, T, etc. values related to HangulSyllableType.txt.) A Grapheme Cluster is then defined as:

GC: (CRLF | (Hangul-Syllable | !Control) Extend* | . )

And to extend that notion to include all spacing combining marks, as well as the existing gcb=Extend characters, you get, for Extended Grapheme Cluster:

EGC: (CRLF | (Hangul-Syllable | !Control) (Extend | SpacingMark)* | . )

Note that these expressions are designed to produce meaningful breaks for a character segmentation algorithm, and are rather different from the definitions of CCS and ECCS. In particular, the definition of "!Control" is rather broader than Base character, and the rules will produce breaks around various edge case characters, for completeness, that would have no relationship to the definitions of combining character sequences. For example, these rules will break around a single, invisible, combining, default ignorable code point like U+FE00 VARIATION SELECTOR-1. U+FE00 is not a base character, is not a graphic character, and is not a (well-formed) CCS.

Hence my concern with keeping these terminological extensions clear and distinct, between Chapter 3 and UAX #29. The Chapter 3 terms are *definitional*, and need not be expected to produce complete results for a segmentation algorithm. The UAX #29 terms are *operational*, and need to produce complete results for a segmentation algorithm, but the entity that ends up between boundaries need not be meaningful as a definitional term for the standard.

Mark and I still differ in opinion somewhat about the exact terminology to use in UAX #29 for the character segmentation rules. My suggestion is to use "GC" and "EGC" as specced above, in particular because we already have such terms also in Chapter 3, and then to use the term "Default Grapheme Cluster (DGC)" to mean the results you get when applying the GC boundary rules with no tailoring. "Default Extended Grapheme Cluster (DEGC)" would mean the results you get whe applying the EGC boundary rules with no tailoring. And if you tailor the rules, you get tailored grapheme clusters, and so on.

Suggested Text for Chapter 3 Definitions

O.k., now here is the actual text I suggest be added to Chapter 3 of the Unicode Standard, for the new definitions of Extended Base and Extended Combining Character Sequence.


D51a Extended base: Any base character, or any standard Korean
     syllable block.
     
     * This term is defined to take into account the fact
       that sequences of Korean conjoining jamo characters
       behave as if they were a single Hangul syllable
       character, so that the entire sequence of jamos
       itself constitutes a base. 
     
     * For the definition of standard Korean syllable block,
       see D117 in Section 3.12, Conjoining Jamo Behavior.

D56a Extended combining character sequence: A maximal
     character sequence consisting of either an extended
     base followed by a sequence of one or more characters
     where each is a combining character, ZERO WIDTH JOINER,
     or ZERO WIDTH NON-JOINER; or a sequence of one or
     more characters where each is a combining character,
     ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER.
     
     * Combining character sequence is commonly abbreviated
       as CCS, and extended combining character sequence
       is commonly abbreviated as ECCS.

Then I think the current definitions of Grapheme cluster (D60) and Extended grapheme cluster (D61) need a little work, as well. To minimize the changes required, I would suggest merely:


D60 Grapheme cluster: The text between grapheme cluster boundaries
    as specified by Unicode Standard Annex, "Text Boundaries."
    
D61 Extended grapheme cluster: The text between grapheme cluster boundaries
    as specified by Unicode Standard Annex, "Text Boundaries,"
    extended to include spacing combining marks.

And then updating the bullet items under D60 and D61 to bring them better into line with the current text of UAX #29, as amended in the Proposed Update.

--Ken