Re: Proposed Updates to Unicode Standard Annexes for Unicode 6.1

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 26 Jul 2011 06:28:12 +0200

2011/7/26 <announcements_at_unicode.org> wrote:
> The proposed update documents for some Unicode Standard Annexes have been
> updated. These updates include:
>
> UAX #29: Updated the discussion of legacy grapheme clusters for Thai. Moved
> the section on Hangul syllable boundary determination to a new section in
> this UAX, from Chapter 3 of the Core Specification. Made other small
> editorial fixes.

It looks like this is almost reversing the state of the recommandation
in favor of extended grapheme clusters, just for the needs of Thai,
Lao and Tai Viet (which are the only scripts encoed with a logical
order exception).

But unfortunately, "legacy grapheme clusters" are not extensing to
most "SpacingMark".

May be it would be more convenient to split the "SpacingMark" category
in two parts:

- (1) create an "Append" category for the listed Thai and Lao appended vowels:
U+0E30 ( ะ ) THAI CHARACTER SARA A
U+0E32 ( า ) THAI CHARACTER SARA AA
U+0E33 ( ำ ) THAI CHARACTER SARA AM
U+0E45 ( ๅ ) THAI CHARACTER LAKKHANGYAO
U+0EB0 ( ະ ) LAO VOWEL SIGN A
U+0EB2 ( າ ) LAO VOWEL SIGN AA
U+0EB3 ( ຳ ) LAO VOWEL SIGN AM

- (2) Exclude the "Append" category from the definition of
"SpacingMark" (remove the list above)
Grapheme_Cluster_Break ≠ Extend, and
Grapheme_Cluster_Break ≠ Append, and
General_Category = Spacing Mark

Then deprecate both the "legacy grapheme cluster boundaries" and the
"extended grapheme cluster boundary", to create an intermediate one
"default grapheme cluster boundaries".

The "default grapheme clusters" will extend the "legacy grapheme
clusters" only to the reduced "SpacingMark" category (but not to the
existing "Prepend" category or the new "Append" category:

default_grapheme_cluster ::=
  ( CRLF
  | ( Hangul-syllable | !Control )
    ( Grapheme_Extend | Spacing_Mark)*
  | . )

For compatibility, the existing "extended grapheme clusters" (not
recommended) will be redefined to be the new "default grapheme cluster
boundaries", extended to also include the existing "Prepend" category
and the new "Append" category. This won't change its generated
boundaries:

extended_grapheme_cluster ::=
  ( CRLF
  | Prepend* ( Hangul-syllable | !Control )
    ( Grapheme_Extend | Spacing_Mark | Append)*
  | . )

This way, Thai, Lao, Tai Viet will be correctly handled in the
prefered way using the new "standard grapheme cluster boundaries",
which can also be recommanded for all other scripts.

-- Philippe.
Received on Mon Jul 25 2011 - 23:33:18 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 25 2011 - 23:33:19 CDT