Re: Regular Expressions and Canonical Equivalence from Doug Ewell on 2015-05-14 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Thu, 14 May 2015 07:08:14 -0700

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

> For example, I believe that one should be able to find
> [...]
> the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH
> CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9 LATIN
> SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>. As far as I
> can tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the
> combination <U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX, U+0323
> COMBINING DOT BELOW> of Vietnamese letter and tone mark.

What you're looking for in this case is neither an NFC match nor an NFD
match, but a language-dependent match, as you imply further down. <1ED9>
decomposes to <006F 0323 0302>, and if you want a match with <00F4>,
which decomposes to <006F 0302>, your regex engine has to reorder the
marks. It sounds unlikely that you'll find such an engine, but there is
a lot of Vietnamese-language–specific software out there, so you never
know.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸

Received on Thu May 14 2015 - 09:10:02 CDT

This archive was generated by hypermail 2.2.0 : Thu May 14 2015 - 09:10:02 CDT