Discontiguous Collation Grapheme Clusters

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 28 May 2012 03:14:42 +0100

I'm currently reviewing the definition of the Unicode
Collation Algorithm (as opposed to just trying to comply with it),
and I came across the concept of collation grapheme clusters, defined in
UTS#18 'Unicode Regular Expressions'.

For what types of strings are they supposed to be defined? Any? NFC?
NFD? FCD? ASCII?

In the English locale (CLDR), what collation clusters does the Tibetan
script NFC & NFD string <U+0F40 KA, U+0F71 AA, U+0F7A E, U+0F74 U,
U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG> consist of? If I assume that
the variable weight setting of IgnoreSP does apply, I end up with the
2 clusters <U+0F40>, <U+0F71, U+0F7A, U+0F74, U+0F0B> if I apply the
definition given in UTS#10 Version 6.1.0 Section 6.9.1. If I apply the
sample code given in UTS#18 Revision 13 Annex B iteratively, I get
the 4 clusters <U+0F40>, <U+0F71>, <U+0F7A>, <U+0F74, U+0F0B>. The
collation look-ups contributing to the collation of the string are for
<U+0F40>, <U+0F71, U+0F74>, <U+0F7A>, <U+0F0B>.

If I apply the algorithms to the canonically equivalent <U+0F40,
U+0F71, U+0F74, U+0F7A, U+0F0B>, both definitions yield the 3 clusters
<U+0F40>, <U+0F71, U+0F74>, <U+0F7A, U+0F0B>, which, apart from TSHEG
not being in a collation cluster of its own, makes sense.

If I apply the algorithms to the FCD string <U+0F40 KA, U+0FB2
SUBJOINED-RA, U+0F75 UU, U+0F0B> in the English locale (CLDR based on
UCA Version 6.1.0), I don't know what to expect from a *compliant*
implementation, as collation elements should be formed from U+0FB2 and
*part* of U+0F75. If I turn to the textual definition in UTS#18 ('A
collation character is the longest sequence of characters that maps to
a sequence of one or more collation elements where the first collation
element has a primary weight and subsequent elements do not, and no
completely ignorable characters are included.'), I get 3 clusters,
<U+0F40>, <U+0FB2>, <U+0F75, U+0F0B>, which is reasonable
linguistically.

If I apply the algorithms to the canonically equivalent NFC & NFD
string <U+0F40, U+0FB2, U+0F71, U+0F74, U+0F0B>, I currently get 3
collation clusters <U+0F40>, <U+0FB2, U+0F71>, <U+0F74, U+0F0B>.
However, the second cluster has two collation elements, both with
primary weights, so by the textural specification I get 3
collation clusters <U+0F40>, <U+0FB2>, <U+0F71, U+0F74, U+0F0B>, which
is reasonable linguistically, but is only a reasonable result because
the contraction <U+0FB2,U+0F71> (not yet in DUCET) is artificial.

The textual definition does not explain how to handle completely
ignorable characters and also appears to be unable to find a collation
cluster in <U+2122 TRADE MARK SIGN>, which yields two collation
elements with primary weights. Are there two clusters here, one for
the 'T' and one for the 'M'?

So, what collation clusters are these strings composed of? Does anyone
have a software implementation that yields them?

The strings were:

0F40 0F71 0F7A 0F74 0F0B
0F40 0F71 0F74 0F7A 0F0B
0F40 0FB2 0F75 0F0B
0F40 0FB2 0F71 0F74 0F0B
2122

Another little gem is that when the Hebrew accent 'METEG' is coded
between the consonants and the vowel (as in the second word of Exodus
20:4 in the Leningrad codex), one gets one collation cluster for the
consonant, one for the METEG, one for the CGJ, and the lonely vowel is
shunted off into a collation cluster with the next vowel. (See
http://scripts.sil.org/cms/scripts/page.php?item_id=Meteg_intheBHS if
you don't have the BHS to hand.)

Bemusedly,

Richard.
Received on Sun May 27 2012 - 21:17:53 CDT

This archive was generated by hypermail 2.2.0 : Sun May 27 2012 - 21:17:54 CDT