From: Philippe Verdy via Unicode <unicode_at_unicode.org>

Date: Sun, 28 Jan 2018 20:29:28 +0100

Date: Sun, 28 Jan 2018 20:29:28 +0100

2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <

unicode_at_unicode.org>:

*> On Sat, 27 Jan 2018 14:13:40 -0800The theory
*

*> of regular expressions (though you may not think that mathematical
*

*> regular expressions matter) extends to trace monoids, with the
*

*> disturbing exception that the Kleene star of a regular language is not
*

*> necessarily regular. (The prototypical example is sequences (xy)^n
*

*> where x and y are distinct and commute, i.e. xy and yx are canonically
*

*> equivalent in Unicode terms. A Unicode example is the set of strings
*

*> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that
*

*> will recognise canonically equivalent strings).
*

*>
*

I don't see why you can't write this as the regular expression:

(x | y)*

For the Unicode canonical equivalences, this applies to distinct characters

that have distinct non-zero combining classes.

But of course searching for

<LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or

<LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >

requires transforming it to NFD first as:

<LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>

so thet the regexp transforms to:

<LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *

( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]

]] * <COMBINING DOT BELOW>

| <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]

]] * < COMBINING CIRCUMFLEX>

Note that the "complex" set of characters used three times above is finite,

it contains all combining characters of Unicode that have a non-zero

combining class except above and below, i.e. all Unicode characters whose

combining class is not 0, 220 (below) or 230 (above).

However, It is too simplified, because the allowed combining classes must

occur at most once in each possible non-zero combining class and not

arbitrary numbers of them: these allowed combining classs currently are in

{1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218,

222, 224, 226, 228, 232..234, 240} whose most member elements are used for

very few combining characters (the above and below combining classes are

the most populated ones but we exclude them, all the others have 1 to 9

combining characters assigned to them, or 22 characters with cc=7 (nukta),

or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama).

Once again we can refine them also as a regexp, but this is combinatorial

because we have 52 combining classes (so we would need to consider the 52!

(factorial) alternates). But given the maximum length of what this can

match (0 to 52 combining characters: yes it is finite), this is best done

by not rewriting this as a regexp but by replacing the final "*" by {1,52},

and then just check each returned match of

[[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52}

with a simple scan of these short strings to see that they all have

distinct combining classes (this just requires 52 booleans, easily stored

in a single 64 bit integer initialized to 0 prior to scan the scan of these

small strings). But the theory does not prevent writing it as a regexp

(even if it would be extremely long). So a Kleene Star closure is possible

and can be efficiently implemented (all depends on the way you represent

the "current state" in the FSM: a single integer representing a single node

number in the traversal graph is not the best way to do that.

This is a valid regexp, the finite state machine DOES have a finite

lookahead (the full regexp above will match AT MOST 107 characters

including the combining marks, where 107=3+2*52), but this is general to

regexps that generally cannot be transformed directly into a FSM with

finite lookahead, but a FSM is possible: the regexp first transforms into a

simple graph of transitions with a finite number of node (this number is

bound to the length of the regexp itself) where there can be multiple

states active simultaneously; then a basic transform converts this small

graph by transforming nodes into new nodes representing the finite set of

the combinations of active states in the first graph :

There will be many more nodes, and generally this explodes in size because

the transform is combinatorial, and such size explosion has worst

perfomance (explosion of the memory needed to representing the new graph

with a single state active). So regexp engines use the alternative by

representing the current state of traversal of the first simple graph using

a stack of active states and transiting them all separately (this can be

implemented with a "bitset" whose size in bits is the number of states in

the first simple graph, or by using an associative array (dictionnary of

boolean properties whose keys are state numbers in the first graph, which

can be set or removed: this requires much less memory and it is relatively

fast, even if the full current state is not just a single small integer.

Received on Sun Jan 28 2018 - 13:30:24 CST

*
This archive was generated by hypermail 2.2.0
: Sun Jan 28 2018 - 13:30:25 CST
*