Wild Card Collation Matches

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 2 Jun 2014 02:36:14 +0100

In a fairly wild environment
(http://www.thaivisa.com/forum/topic/730564-new-front-end-to-ri-dictionary-alpha),
I encountered the following question:

"If you search for ก* do you expect to return words such as เก่ง and
ไก่?"

Now, as a regular expression, in UTS#18 'Unicode Regular Expressions'
Version 13 (dated 2008, superseded in 2012), RL3.5 comes pretty close
to this with ranges tailored for collation. The pattern
[\u0E01-\u0E02]* would match both those words. To be precise, one
would use a search for [ก-ไก]*. RL3.5 has been with withdrawn because
of difficulties, though I can't say that I see it as a major difficulty
that at least one of [A-z] and [a-Z] is empty. Even POSIX is aware of
that little issue.

Turning to fully collation-based definition of searches, UTS#10
Unicode Collation Algorithm's definition DS2 comes closest to answering
the question for the UTC. DS2 reads:

DS2. The pattern string P has a match at Q[s,e] according to collation
C if C generates the same sort key for P as for Q[s,e], and the offsets
s and e meet the boundary condition B. One can also say P has a match
in Q according to C.

It's a soft job to create sequences of codepoints P starting with
U+0E01 THAI CHARACTER KO KAI that are tertiary matches for เก่ง and
ไก่ under both DUCET and the CLDR collations for Thai. Can I therefore
say that the two strings match the pattern ก* according to these
collations? (A pattern P for ไก่ <U+0E44 THAI CHARACTER SARA AI
MAIMALAI, U+0E01 THAI CHARACTER KO KAI, U+0E48 THAI CHARACTER MAI EK> is
P = <U+0E01, U+0E34F COMBINING GRAPHEME JOINER, U+0E44, U+0E48>.)

Disturbingly, another possible answer is that there is no match for
<U+0E01 THAI CHARACTER KO KAI> in either string because it only occurs
in the legacy/extended grapheme cluster <U+0E01, U+0E48>.

Richard.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sun Jun 01 2014 - 20:37:50 CDT

This archive was generated by hypermail 2.2.0 : Sun Jun 01 2014 - 20:37:50 CDT