From: Richard Wordingham <richard.wordingham_at_ntlworld.com>

Date: Tue, 20 May 2014 09:04:10 +0100

Date: Tue, 20 May 2014 09:04:10 +0100

On Mon, 19 May 2014 00:06:01 +0100

Richard Wordingham <richard.wordingham_at_ntlworld.com> wrote:

*> I have done a little mathematical work on regular expressions and
*

*> canonical equivalence, but these were true regular expressions, i.e.
*

*> recognisable by finite automata.
*

*> I came to the disappointing conclusion that
*

*>
*

*> (\p{name=COMBINING DOT BELOW}\p{name=COMBINING ACUTE ACCENT})*
*

*>
*

*> was not a true regular expression, at least, not in any sense that
*

*> allows the expression to denote an infinite set of strings. One could
*

*> define * to be restricted to character-wise concatenations that
*

*> yielded NFD strings, but this is potentially very confusing. It
*

*> might be argued to be in line with RL2.1 in UTS #18.
*

It has been observed off-list that a literal seach of an NFD string

will obviously not match \u0323\u0301\u0323\u0301. To eliminate any

confusion induced by using the example of searching, I will now talk

about the 'regular expression' x(\u0323\u0301)*y. (I did my with the

conventional task of recognising patterns.) My aim in the mathematical

work was to find a match if the pattern matches anything canonically

equivalent to the searched string. I accept that the matched substring

will in general be discontiguous, and that if the searched string is

not in NFD the substring will contain only parts of some characters.

So, when searching an NFD string for what I thought of as

'x(\u0323\u0301)*y', that is a search for any of 'xy',

'x\u0323\u0301y', 'x\u0323\u0323\u0301\u0301y',

'x\u0323\u0323\u0323\u0301\u0301\u0301y' and so on. Those who have

studied the mathematical theory of regular expressions should recall

that there is no *finite* automaton that will implement this search.

Therefore, by definition, the pattern is not a true regular

expression. That is how I came to my 'disappointing conclusion' above.

I think this limitation is not actually a problem, though it would be

something to bear in mind. Length marks in Tibetan vowels and nuktas

on Thai vowels looked the likeliest problem areas, but there seem not

to be problems.

Incidentally, although I have spoken of searching an NFD string, if one

has a finite automaton to do that, one can (given enough time and

memory) create another finite automaton that will do the same job on

unnormalised strings. The trick to the *proof* is to remember the

latest batch of non-starters in the unrearranged decomposition of the

search string, grouped by canonical combining class. (Equivalence

classes can be used to keep the numbers finite.) Whether this is

useful may be another matter!

Now, this process is very similar to converting the expansions of the

pattern string to NFD and then searching for them. However, that

conversion is itself a complicated business, similar to compiling a

regular expression. One difference is that I would like a search for the

Vietnamese vowel U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX

(decomposes to 0061 0302) to match the same vowel with a tone mark for

the nặng tone, U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT

BELOW (decomposes to 0061 0323 0302).

As I asked before, is anyone using a regular expression system where the

potentially complicated pattern string does not imply NFD strings but

the search respects canonical equivalence? There are systems (e.g

HarfBuzz) that resort to changing the canonical combining classes to

make things work.

Richard.

_______________________________________________

Unicode mailing list

Unicode_at_unicode.org

http://unicode.org/mailman/listinfo/unicode

Received on Tue May 20 2014 - 03:05:25 CDT

*
This archive was generated by hypermail 2.2.0
: Tue May 20 2014 - 03:05:25 CDT
*