Re: Regular Expressions and Canonical Equivalence from Philippe Verdy on 2015-05-15 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 15 May 2015 22:09:13 +0200

2015-05-15 9:10 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Fri, 15 May 2015 02:10:36 +0200
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
> > 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
> > richard.wordingham_at_ntlworld.com>:
> >
> > > On Thu, 14 May 2015 12:58:29 +0200
> > > Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
> > >
> > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> > > > richard.wordingham_at_ntlworld.com>:
> > > >
> > > > > An elegant formal solution to the Kleene star problem interprets
> > > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is
> > > > > counter-intuitive
> > >
> > > The technical term for this is the 'concurrent iteration' - or at
> > > least, that's the term used in the 'Book of Traces'.
> > >
> > > > For your example "(\u0323\u0302)*" the characters in the
> > > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT),
> > > > once converted to NFD (which is the same here) are just using at
> > > > most two distinct non-zero combining classes and no blocker; so
> > > > it is safe to trasform it to (\u0323|\u0302)* for a first pass
> > > > matching that will then only check candidate matchs in the second
> > > > pass. or more efficiently, a second finite state automata (FSA)
> > > > running in parallel with its own state:
> > >
> > > You've forgotten the basic problem. A *finite* state automaton
> > > cannot count very far; with only n states, it cannot count as far
> > > as n.
> > >
> >
> > I did not forget it, this is why there's a second pass (or a second
> > FSA running in parallel to indicate its own accept state). You have
> > to combine the two states variables to get the final combined state
> > to determine if it is a final accept state.
>
> Your description makes no sense to me as a description of a finite
> state automaton.

This is because you don't understand the issue !

> Now, a program to check whether a trace matching
> {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just
> counts the number of times \u0323 occurs and the number of times
> \u0302 occurs, and returns whether they are equal.

This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would pass
your counting test (which does not work in a FSA) but they are NOT
canonically equivalent because the identical combining characters are
blocking each other (so arbitrary ordering is not possible).

I maintain what I said: you don't need arbitrary counting and a FSA is
possible (both NFA, using compound states, and the derived DFA if ever you
want to resolve compound states to a single integer, but assume the fact
the the transition tables will explode dramatically)

Once again we cannot have pairs of strings where you cannot isolate BOUNDED
substrings (between blockers) where you can check their canonically
equivalence. At most you'll have only 255 combining characters to check
that have distinct non-zero combining classes.

So the FSA implementation is perfectly possible, for canonical equivalences
only... This evidently does not work if you are performing regexp searches
using looser equivalences, such as compatibility equivalence.
Received on Fri May 15 2015 - 15:10:52 CDT

This archive was generated by hypermail 2.2.0 : Fri May 15 2015 - 15:10:52 CDT