From: Philippe Verdy <verdy_p_at_wanadoo.fr>

Date: Thu, 14 May 2015 12:58:29 +0200

Date: Thu, 14 May 2015 12:58:29 +0200

2015-05-14 9:59 GMT+02:00 Richard Wordingham <

richard.wordingham_at_ntlworld.com>:

*> An elegant formal solution to the Kleene star problem interprets
*

*> (\u0323\u0302)* as (\u0323|\u0302)*. However, that is
*

*> counter-intuitive
*

Yes it is problematic: (ab)* is not the same as (a|b)* as this requires

matching pairs of letters "ab" in that order in the first expression, but

random strings of "a" and "b" i nthe second one (so the second matches

*more* input samples.

Even if you consider canonical equivalences (where the relative order of

"ab" does not matter for example because they have distinct non-zero

canonical) this does not mean that "a" alone will match in the first

expression "(ab*)", even though it MUST match in "(a|b)*".

So the solution is just elegant to simplify the first level of analysis of

"(ab)*" by using "(a|b)*" instead. But then you need to perform a second

pass on the match to make sure it is containing only complete sequences

"ab" in that order (or any other order if they are all combining with a

non-zero combining class) and no unpaired "a" or "b".

Such transform using two passes should only be made when subregexps within

a "(...)*" contain only alternatives (converted to NFD) such then each of

them contains ONLY combining characters with distinct non-zero combining

classes. If one of the alternatives "ab" contains any character with

combining class 0 or if they have blockers with identical non-zero

combining classes, we cannot use this transform.

But this transform using two passes is stil elegant: the alternatives where

we can use it and that requires a second pass have a bounded length (it's

impossible for them to be longer than 255 codepoints given there cannot be

more than 255 *distinct* non-zero combining classes. But even in this case,

the current UCD currently uses a much lower number of non-zero combining

classes, so this limit is even lower: the substrings where this transform

is possible will be extremely short and a second pass on them will be

extremely fast (using very small string buffers that can stay in memory).

For your example "(\u0323\u0302)*" the characters in the alternatives

(COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to NFD

(which is the same here) are just using at most two distinct non-zero

combining classes and no blocker; so it is safe to trasform it to

(\u0323|\u0302)* for a first pass matching that will then only check

candidate matchs in the second pass. or more efficiently, a second finite

state automata (FSA) running in parallel with its own state:

in your example this second FSA just has 2 states: the initial state 0

which is also the final/accept state, and state 1 after matching one

character of the pair. When you reach the point where matching (\u0323 |

\u0302)* with the first level of analysis would terminate, you just need to

check the state of the 2nd FSA to see if it is also in the

initial/final/accept state 0 (otherwise this is not an valid accept state

for the untransformed (\u0323\u0302)* regexp.

However, the most difficult part for regexps supporting canonical

equivalence is about what we can do to return submatches! they are not

necessarily contiguous in the input stream. You can still return a matching

substring but if you use it for performing search/replace operations, it

becomes difficult to know where to place the replacement, when that

replacement string (even if it was converted first to NFD) may also contain

combining characters. Or even worse if the replacement contains some

blockers that will be inserted in the middle of the non-replaced text (wnad

where can we safely place the remaining characters in the middle of the

match but that are not part of the match itself ???

One solution is to not exclude these characters in the middle of a match

and return them too. It's up to the replacement function to check their

existence:

The regexp engine can just provide an additional index of characters in the

returned matched substring, that are in fact not part of the actual match

but present in the middle, instead of just the substring for the match.

Or it can return just the exact matching substring, but also an index array

containing their relative position in the actual input string (in standard

matches those indexes would be the sequence of intergers 0 to N-1 where N

is the length of the matched substring, but if the sequence is

discontinusous in the input, the sequence will still be growing with some

steps higher than 1, leaving some holes (the last index in that sequence

will be equal to or higher than N).

Received on Thu May 14 2015 - 06:00:03 CDT

*
This archive was generated by hypermail 2.2.0
: Thu May 14 2015 - 06:00:04 CDT
*