Re: Regexes, Canonical Equivalence and Backtracking of Input from Philippe Verdy on 2015-05-18 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 18 May 2015 22:56:47 +0200

Isn't it possible for your basic substitution to transform \uf073 into a
character class [\uf071\uf072\uf073] that the regexp considers as a single
entity to check ?
In that case, backtracking for matching \u0F73*\u0F72 is simpler:
[\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only one
character class (instead of one character).

It is also posible also to transform \u0F73*\u0F72 into the really
equivalent: (\u0F71\0F72)*\u0F72 | (\0F72\u0F71)*\u0F72 | (\0F73)*\u0F72
(assuming that in the non-capturing group you are already performing
canonical reorderings using counters (as many counters as there are
distinct ccc values in these groups, excluding blockers that create groups
always matched separately without any need to use backtrack "through" them:
if this does not match as at a blocking position, there's no other
alternative possible, so this is a definitive non-match)

2015-05-18 22:32 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Mon, 18 May 2015 21:05:49 +0200
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
> > 2015-05-18 20:35 GMT+02:00 Richard Wordingham <
> > richard.wordingham_at_ntlworld.com>:
> >
> > > The algorithm itself should be tractable - Mark Davis has published
> > > an algorithm to generate all strings canonically equivalent to a
> > > Unicode string, and what we need might not be so complex.
> >
> >
> > Even this algorithm from Mark Davis will fail in this case:
>
> How so? The regexp is \u0F73*, which is converted to a non-capturing
> (\u0F71\u0F72)*.
>
> Given a string 0F40 0F71 0F73 0F42 representing the trace, matching
> will fail at 0F40 and an attempt will be made starting at the 0F71.
> The input string handling part will then present a run of three
> non-starters:
>
> \u0F71 \u0F71 \u0F72
>
> I think the process is even simpler than I first thought.
>
> The engine will look for a match for \u0F71, and take it from this
> list, leaving \u0F71 \u0F72.
>
> It will then look for a match for \u0F72, and take it form the list,
> leaving \u0F71.
>
> It will then look for a match for \u0F71, and take it from the list.
>
> It will then look for a match for \u0F72. It will fail, and then back
> track, disgorging the \0F71.
>
> The input 'stream' now looks like \u0F71 \u0F42. This will match
> nothing; it is after the matching substream.
>
> The matching substring is:
>
> None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42.
>
> Its value, as a trace, is 0F71 0F72.
>
> > - You can use it easily to transform a regexp containing (\u0F73)
> > into a regexp containing (\u0F73|\u0F71\u0F72|\u0F71\u0F72)
>
> That is *not* what I am suggesting. The regex needs decomposing, but
> no other transformations. It is the string representing the input
> trace that is expanded.
>
> > - But this leaves the same problem for unbounded repetititions with
> > the "+" or "*" or "{m,}" operators.
>
> Not at all - that is the beauty of the scheme. On the regex
> side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*.
> Putting back the unused fragments of the run of non-starters in the
> input trace is the most difficult part.
>
> > Now all the problem is how to do the backtracking,
>
> Yes, that may be more difficult than I thought. Comparing against
> literal characters is simple, but it may be more complicated when
> matching against a list of alternative characters. Back-tracking
> schemes may not be set up to try the next character on a list of
> alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input
> string 0F71 0F72. The alternative (\u0f72|\u0f71) would first take the
> 0F72, and only on backtracking would it take the 0F71 instead. This is
> an issue with traces that does not appear with strings.
>
> Richard.
>
Received on Mon May 18 2015 - 15:58:06 CDT

This archive was generated by hypermail 2.2.0 : Mon May 18 2015 - 15:58:06 CDT