Re: ? Wrong definitions for combining character sequence in tr 29

From: karl williamson (public@khwilliamson.com)
Date: Tue Nov 24 2009 - 16:48:34 CST

  • Next message: Kenneth Whistler: "Re: ? Wrong definitions for combining character sequence in tr 29"

    Thanks for your reply. I'm afraid I'm still confused.

    The sentence before Table 1b is the first mention in this document of
    combining character sequences; it would be nice it it discussed what
    they were, and why even mention them at all? In the past, I just
    presumed they were an earlier concept that was superseded by grapheme
    clusters.

    They are discussed some in 3.6 of the actual standard, and here there
    seem to me to be contradictions:

    "• A grapheme cluster is similar, but not identical to a combining
    character sequence. A combining character sequence starts with a base
    character and extends across any subsequent sequence of combining marks,
    nonspacing or spacing. A combining character sequence is most directly
    relevant to processing issues related to normalization, comparison, and
    searching.
    • A grapheme cluster starts with a grapheme base and extends across any
    subsequent sequence of nonspacing marks. A grapheme cluster is most
    directly relevant to text rendering and such processes as cursor
    placement and text selection in editing."

    This seems to me to imply that a base character is always the first item
      of a combining character sequence, and the word 'any' seems to me to
    imply 0 or more marks following it. The definition earlier in the
    section, however, does give the definition in the table we're
    discussing. I do see why a mark in isolation could be coerced into
    being considered as a base character in that context.

    And this doesn't help me understand why there is the concept of a
    combining character sequence and why that is more relevant than a
    grapheme cluster to normalization, comparison, and searching.

    verdy_p wrote:
    > The definition is correct, and explained in the table which says "A single base character is **not** a combining
    > character sequence."
    >
    > The table makes distinctions between the four cases, defined without overlaps, that can make (when joined
    > **together** in a union) a single grapheme cluster.

    I don't understand your statement above.

    >
    > Your conclusion is wrong, because a single letter 'A' is defined as a "legacy grapheme cluster" and a "legacy
    > grapheme cluster ***is*** a grapheme cluster:
    >
    > ( CRLF
    > | ( Hangul-syllable | !Control )
    > Grapheme_Extend*
    > | . )
    >
    > because it matches "!Control". The same row in the table says that "A single base character is a grapheme cluster".
    >
    > And this is also said at the before in section the section 3, just below table 1a:
    > "A legacy grapheme cluster is defined as a base (such as A or カ) followed by zero or more continuing characters."
    >
    > The "legacy rgapheme cluster" are the simplest and most common forms of grapheme clusters recognized in almost all
    > applications. don't interpret "legacy" as meaning "included just for comaptibility", or meaning "still supported but
    > not recommended", it just means the most limitative definition used in most legacy applications that don't recognize
    > the other forms.

    Earlier in tr29 it says, "The extended grapheme cluster boundaries are
    recommended for general processing, while the legacy grapheme cluster
    boundaries are maintained for backwards compatibility with earlier
    versions of this specification."
    >
    > The same can be said about the extended grapheme clusters that **are** also grapheme clusters.
    >
    > Philippe.
    >
    >> Message du 24/11/09 03:05
    >> De : "karl williamson"
    >
    >> A : "unicode@unicode.org"
    >> Copie à :
    >> Objet : ? Wrong definitions for combining character sequence in tr 29
    >>
    >>
    >> It is defined as
    >> base? ( Mark | ZWJ | ZWNJ )+
    >>
    >> That means that a mark is required. So the letter 'A' is not a grapheme
    >> cluster.
    >>
    >> Similarly for the definition for the extended
    >>
    >>
    >>
    >
    >
    Thanks again



    This archive was generated by hypermail 2.1.5 : Tue Nov 24 2009 - 16:52:03 CST