Re: Regular Expressions and Canonical Equivalence from Richard Wordingham on 2015-05-14 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 14 May 2015 18:55:33 +0100

On Thu, 14 May 2015 12:25:06 -0500
Stephen E Slevinski Jr <slevin_at_signpuddle.net> wrote:

> On 5/14/15 5:58 AM, Philippe Verdy wrote:
> > Yes it is problematic: (ab)* is not the same as (a|b)* as this
> > requires matching pairs of letters "ab" in that order in the first
> > expression, but random strings of "a" and "b" i nthe second one (so
> > the second matches *more* input samples.
> >
> > Even if you consider canonical equivalences (where the relative
> > order of "ab" does not matter for example because they have
> > distinct non-zero canonical) this does not mean that "a" alone will
> > match in the first expression "(ab*)", even though it MUST match in
> > "(a|b)*".
> >
> > So the solution is just elegant to simplify the first level of
> > analysis of "(ab)*" by using "(a|b)*" instead. But then you need to
> > perform a second pass on the match to make sure it is containing
> > only complete sequences "ab" in that order (or any other order if
> > they are all combining with a non-zero combining class) and no
> > unpaired "a" or "b".
>
> If you always want to find "a" and "b" in a pair without regard to
> the order, how about the regex:
> ((ab)|(ba))*

In NFD, the language (\u0323\u0302)* consists of

ε (empty string)
\u0323\u0302
\u0323\u0323\u0302\u0302
\u0323\u0323\u0323\u0302\u0302\u0302
\u0323\u0323\u0323\u0323\u0302\u0302\u0302\u0302

and so on.

Therefore the finite automaton implied by your regex won't work. No
regular expression will work. That is mathematically proven. What I
have listed above is the standard example of a 'non-regular language',
a set of strings that cannot be defined by a finite site of regular
expressions.

Richard.
Received on Thu May 14 2015 - 12:56:45 CDT

This archive was generated by hypermail 2.2.0 : Thu May 14 2015 - 12:56:45 CDT