From: karl williamson (public@khwilliamson.com)
Date: Tue Nov 24 2009 - 16:48:34 CST
Thanks for your reply. I'm afraid I'm still confused.
The sentence before Table 1b is the first mention in this document of
combining character sequences; it would be nice it it discussed what
they were, and why even mention them at all? In the past, I just
presumed they were an earlier concept that was superseded by grapheme
clusters.
They are discussed some in 3.6 of the actual standard, and here there
seem to me to be contradictions:
"• A grapheme cluster is similar, but not identical to a combining
character sequence. A combining character sequence starts with a base
character and extends across any subsequent sequence of combining marks,
nonspacing or spacing. A combining character sequence is most directly
relevant to processing issues related to normalization, comparison, and
searching.
• A grapheme cluster starts with a grapheme base and extends across any
subsequent sequence of nonspacing marks. A grapheme cluster is most
directly relevant to text rendering and such processes as cursor
placement and text selection in editing."
This seems to me to imply that a base character is always the first item
of a combining character sequence, and the word 'any' seems to me to
imply 0 or more marks following it. The definition earlier in the
section, however, does give the definition in the table we're
discussing. I do see why a mark in isolation could be coerced into
being considered as a base character in that context.
And this doesn't help me understand why there is the concept of a
combining character sequence and why that is more relevant than a
grapheme cluster to normalization, comparison, and searching.
verdy_p wrote:
> The definition is correct, and explained in the table which says "A single base character is **not** a combining
> character sequence."
>
> The table makes distinctions between the four cases, defined without overlaps, that can make (when joined
> **together** in a union) a single grapheme cluster.
I don't understand your statement above.
>
> Your conclusion is wrong, because a single letter 'A' is defined as a "legacy grapheme cluster" and a "legacy
> grapheme cluster ***is*** a grapheme cluster:
>
> ( CRLF
> | ( Hangul-syllable | !Control )
> Grapheme_Extend*
> | . )
>
> because it matches "!Control". The same row in the table says that "A single base character is a grapheme cluster".
>
> And this is also said at the before in section the section 3, just below table 1a:
> "A legacy grapheme cluster is defined as a base (such as A or カ) followed by zero or more continuing characters."
>
> The "legacy rgapheme cluster" are the simplest and most common forms of grapheme clusters recognized in almost all
> applications. don't interpret "legacy" as meaning "included just for comaptibility", or meaning "still supported but
> not recommended", it just means the most limitative definition used in most legacy applications that don't recognize
> the other forms.
Earlier in tr29 it says, "The extended grapheme cluster boundaries are
recommended for general processing, while the legacy grapheme cluster
boundaries are maintained for backwards compatibility with earlier
versions of this specification."
>
> The same can be said about the extended grapheme clusters that **are** also grapheme clusters.
>
> Philippe.
>
>> Message du 24/11/09 03:05
>> De : "karl williamson"
>
>> A : "unicode@unicode.org"
>> Copie à :
>> Objet : ? Wrong definitions for combining character sequence in tr 29
>>
>>
>> It is defined as
>> base? ( Mark | ZWJ | ZWNJ )+
>>
>> That means that a mark is required. So the letter 'A' is not a grapheme
>> cluster.
>>
>> Similarly for the definition for the extended
>>
>>
>>
>
>
Thanks again
This archive was generated by hypermail 2.1.5 : Tue Nov 24 2009 - 16:52:03 CST