Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues from Mark Davis ☕️ via Unicode on 2017-12-09 (Unicode Mail List Archive)

From: Mark Davis ☕️ via Unicode <unicode_at_unicode.org>
Date: Sat, 9 Dec 2017 16:16:44 +0100

1. You make a good point about the GB9c. It should probably instead be
something like:

GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant

Extend is a broader than necessary, and there are a few items that have
ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.

https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory

Mark <https://twitter.com/mark_e_davis>

On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
unicode_at_unicode.org> wrote:

> Apart from the likely but unmandated consequence of making editing
> Indic text more difficult (possibly contrary to the UK's Equality Act
> 2010), there is another difficulty that will follow directly from the
> currently proposed expansion of grapheme clusters
> (https://www.unicode.org/reports/tr29/proposed.html).
>
> Unless I am missing something, text boundaries have hitherto been
> cunningly crafted so that they are not changed by normalisation.
> Have I missed something, or has there been a change in policy?
>
> For extended grapheme clusters, the relevant rules are proposed as:
>
> GB9: × (Extend | ZWJ | Virama)
>
> GB9c: (Virama | ZWJ ) × LinkingConsonant
>
> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
> This would lead canonically equivalent text to have strikingly
> different divisions:
>
> <consonant, nukta, virama, consonant> (no break)
>
> but
>
> <consonant, virama, nukta | consonant>
>
> There are other variations on this theme. In Tai Tham, we have the
> following conflict:
>
> natural order, no break:
>
> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
>
> but normalised, there would be a break:
>
> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
>
> From reading the text, it seems that it is expected that the presence
> or absence of a break should be fine-tuned by CLDR language-specific
> rules. How is this expected to work, e.g. for Saurashtra in Tamil
> script? (There's no Saurashtra data in Version 32 of CLDR.) Would the
> root locale now specify the default segmentation rule, rather than
> UAX#29 plus the Unicode Character Database?
>
> Richard.
>
>
Received on Sat Dec 09 2017 - 09:16:44 CST

This archive was generated by hypermail 2.2.0 : Sat Dec 09 2017 - 09:17:23 CST