**From:** Philippe Verdy (*verdy_p@wanadoo.fr*)

**Date:** Sat Oct 13 2007 - 16:21:01 CDT

**Previous message:**Hans Aberg: "Re: FYI: Regex paper for UTC"**In reply to:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Next in thread:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Reply:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

Hans Aberg [mailto:haberg@math.su.se] wrote:

*> Envoyé : samedi 13 octobre 2007 21:57
*

*> À : verdy_p@wanadoo.fr
*

*> Cc : Mark Davis; Unicode List
*

*> Objet : Re: FYI: Regex paper for UTC
*

*>
*

*> On 13 Oct 2007, at 21:36, Philippe Verdy wrote:
*

*>
*

*> >> That operation is not very useful, because the language complement of
*

*> >> say a single character c is the set of all other strings. So if one
*

*> >> is finding the longest string in the language from a point on, and
*

*> >> the string isn't c, all will be eaten.
*

*> >
*

*> > No, such operation is typically used in association with a "&&"
*

*> > operator
*

*> > that restricts the set of matchable strings. They are used also for
*

*> > matching
*

*> > left and right contexts without including these contexts in the
*

*> > returned
*

*> > match.
*

*> >
*

*> > But as you said, "all will be eaten" ONLY IF "the string is not c",
*

*> > so the
*

*> > effect of negation is NOT producing the whole set of possible texts.
*

*>
*

*> The problem is that any string starting with c will also be in the
*

*> language complement and matched.
*

No, it won't. A string containing a c (at any position, and in any non-null

number of positions), contains a match for the regexp "c", so it is not part

of the complement.

The expression "a -- b" matches "a" only if it does not contain "b". Both

"a" and "b" are tried in parallel (trying to match "b" is performed

repeatedly on every position), and as soon as a match on "b" is found, it

eliminates the ongoing candidate match on "a".

The "--" and "~" operators are not operators acting on a single instance of

the language, they are set operations acting on ALL possible matches

simultaneously, i.e. in the domain of sets of strings, not on the domain of

isolated strings.

I perceive all regexps to be idempotent (linked bijectively) to the set of

strings where they find ***AT LEAST*** one match. The empty regexp is

idempotent to the whole set of strings (including the empty string).

When matching a regexp against a text, the result is not a single match, not

even just the longest one, it is the complete set of all matches found in

that text.

Then if you want to find just the leftmost shortest match, you just need to

fix an ordering on top of this set of matches, by ordering them first by

growing initial position, then by decreasing length: when you do that, you

have a well defined "nextMatch()" method for your need (so that you don't

need to build the complete set associated to the regexp, because ordered

sets of matches are iterable using the ordering condition, i.e. here the

initial position of the last match and its length).

You may as well want to find the leftmost longest match, in a method named

for example nextLongestMatch(), by calling repeatedly the nextMatch() method

described above as long as it returns the same initial position for the

returned interval; when another initial position is found or no further

match is returned by nextMatch(), store this last match interval for the

next call to nextLongestMatch(), and return the result of the last

nextMatch(); otherwise return a NULL match (this is how you would filter out

the set of matches).

The suite of calls to nextMatch() until it returns NULL will iterate over

the complete set of matches associated to the regexp and found in its input

text.

In other words, matching a regexp against an input text is defined as

computing the intersection of two sets:

* the complete set of strings generated by the regexp itself

* the complete set of unique intervals within the input text.

To compute the intersection,

* each string in the regexp set is uniquely associated in a (interval,

string) pair defining an infinite but enumerable set, with a "any-interval"

interval

* each distinct intervals in the input text is associated in a (interval,

string) pair defining an element of the set to intersect with the previous

set.

The result is then effectively a set of (interval, string) pairs.

For regexps that support N capturing groups, you just need to replace the

concept of "interval" by "N-tuple of intervals" (which are also enumerable)

in the definition above.

More generally, a regexp is not distinct from a (possibly infinite) set of

strings:

: RE("a") = {"a"};

: RE("a{0,3}") = {"", "a", "aa", "aaa"};

: RE("a+") = {"a", "aa", "aaa", ..., "aaaa...a", ... };

: RE("a*") = {"", "a", "aa", "aaa", ..., "aaaa...a", ... };

And each string element of these sets are bijectively associated to the

infinite set PS of positioned strings, for example:

: PS("aa") = {

((0,0),"aa"),

((0,1),"aa"), ((1,0),"aa"),

((0,2),"aa"), ((1,1),"aa"), ((2,0),"aa"),

...

};

Here the intervals are noted as (initial position, length) pairs (the way we

can enumerate this set in the general case is by fixing a value for

position+length and varyingthe position between 0 and this sum, before

incrementing this sum and restarting; however if the sum has an upper bound

MAX, the easiest way is to use the ordered enumeration by fixing a start

position between 0 and MAX, and varying length from 0 to MAX-position: this

is the ordered enumeration used generally for finding matches).

So the regexp is also bijectively associated with the union of these PS()

sets which are enumerable, and so, orderable.

You can easily demonstrate that regexps containing a finite character class

(containing a finite number of characters, because the input alphabet is

finite and a character class is included in this input alphabet) is also

enumerable despite it is also associated bijectively to an infinite set of

positioned strings for matching purposes.

You can demonstrate this same property also easily for regexps containing

concatenations of regexps, or alternation of regexps, or any bounded or

unbounded kleen closure of a regexp (hint: use the inference by recursion).

Matching the regexp "a+" against a string (input_text) is then computing the

intersection between:

* the union of all subsets like PS("a"), PS("aa"), PS("aaa")... : this set

has infinite cardinality

* and PS(input_text) : this enumerable set is finite if input_text is

finite.

The result is also an enumerable set of positioned strings.

All valid regexp operators are acting on enumerable sets of positioned

strings, and they return enumerable sets of positioned strings. Using an API

like nextMatch() is just implementing an iterator on such sets, by ordering

it: it is trivial to apply the same ordering on the infinite set of

positioned strings associated to the regexp independently of the input_text,

and so to support the API with two simple state variables (the current start

positions, growing when match longer strings from this position becomes

impossible due to the finite input_text length constraint, and the length of

the current match):

You don't need to enumerate completely the infinite set associated to the

regexp itself because the length of the input_text is finite, and defines

the maximum length of the intervals in the returned intersection.

**Next message:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**Previous message:**Hans Aberg: "Re: FYI: Regex paper for UTC"**In reply to:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Next in thread:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Reply:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Sat Oct 13 2007 - 16:23:05 CDT
*