RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 16:21:01 CDT

Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

Previous message: Hans Aberg: "Re: FYI: Regex paper for UTC"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg [mailto:haberg@math.su.se] wrote:
> Envoyé : samedi 13 octobre 2007 21:57
> À : verdy_p@wanadoo.fr
> Cc : Mark Davis; Unicode List
> Objet : Re: FYI: Regex paper for UTC
>
> On 13 Oct 2007, at 21:36, Philippe Verdy wrote:
>
> >> That operation is not very useful, because the language complement of
> >> say a single character c is the set of all other strings. So if one
> >> is finding the longest string in the language from a point on, and
> >> the string isn't c, all will be eaten.
> >
> > No, such operation is typically used in association with a "&&"
> > operator
> > that restricts the set of matchable strings. They are used also for
> > matching
> > left and right contexts without including these contexts in the
> > returned
> > match.
> >
> > But as you said, "all will be eaten" ONLY IF "the string is not c",
> > so the
> > effect of negation is NOT producing the whole set of possible texts.
>
> The problem is that any string starting with c will also be in the
> language complement and matched.

No, it won't. A string containing a c (at any position, and in any non-null
number of positions), contains a match for the regexp "c", so it is not part
of the complement.

The expression "a -- b" matches "a" only if it does not contain "b". Both
"a" and "b" are tried in parallel (trying to match "b" is performed
repeatedly on every position), and as soon as a match on "b" is found, it
eliminates the ongoing candidate match on "a".

The "--" and "~" operators are not operators acting on a single instance of
the language, they are set operations acting on ALL possible matches
simultaneously, i.e. in the domain of sets of strings, not on the domain of
isolated strings.

I perceive all regexps to be idempotent (linked bijectively) to the set of
strings where they find ***AT LEAST*** one match. The empty regexp is
idempotent to the whole set of strings (including the empty string).

When matching a regexp against a text, the result is not a single match, not
even just the longest one, it is the complete set of all matches found in
that text.

Then if you want to find just the leftmost shortest match, you just need to
fix an ordering on top of this set of matches, by ordering them first by
growing initial position, then by decreasing length: when you do that, you
have a well defined "nextMatch()" method for your need (so that you don't
need to build the complete set associated to the regexp, because ordered
sets of matches are iterable using the ordering condition, i.e. here the
initial position of the last match and its length).

You may as well want to find the leftmost longest match, in a method named
for example nextLongestMatch(), by calling repeatedly the nextMatch() method
described above as long as it returns the same initial position for the
returned interval; when another initial position is found or no further
match is returned by nextMatch(), store this last match interval for the
next call to nextLongestMatch(), and return the result of the last
nextMatch(); otherwise return a NULL match (this is how you would filter out
the set of matches).

The suite of calls to nextMatch() until it returns NULL will iterate over
the complete set of matches associated to the regexp and found in its input
text.

In other words, matching a regexp against an input text is defined as
computing the intersection of two sets:
* the complete set of strings generated by the regexp itself
* the complete set of unique intervals within the input text.

To compute the intersection,
* each string in the regexp set is uniquely associated in a (interval,
string) pair defining an infinite but enumerable set, with a "any-interval"
interval
* each distinct intervals in the input text is associated in a (interval,
string) pair defining an element of the set to intersect with the previous
set.

The result is then effectively a set of (interval, string) pairs.

For regexps that support N capturing groups, you just need to replace the
concept of "interval" by "N-tuple of intervals" (which are also enumerable)
in the definition above.

More generally, a regexp is not distinct from a (possibly infinite) set of
strings:
: RE("a") = {"a"};
: RE("a{0,3}") = {"", "a", "aa", "aaa"};
: RE("a+") = {"a", "aa", "aaa", ..., "aaaa...a", ... };
: RE("a*") = {"", "a", "aa", "aaa", ..., "aaaa...a", ... };
And each string element of these sets are bijectively associated to the
infinite set PS of positioned strings, for example:
: PS("aa") = {
    ((0,0),"aa"),
    ((0,1),"aa"), ((1,0),"aa"),
    ((0,2),"aa"), ((1,1),"aa"), ((2,0),"aa"),
    ...
  };
Here the intervals are noted as (initial position, length) pairs (the way we
can enumerate this set in the general case is by fixing a value for
position+length and varyingthe position between 0 and this sum, before
incrementing this sum and restarting; however if the sum has an upper bound
MAX, the easiest way is to use the ordered enumeration by fixing a start
position between 0 and MAX, and varying length from 0 to MAX-position: this
is the ordered enumeration used generally for finding matches).

So the regexp is also bijectively associated with the union of these PS()
sets which are enumerable, and so, orderable.

You can easily demonstrate that regexps containing a finite character class
(containing a finite number of characters, because the input alphabet is
finite and a character class is included in this input alphabet) is also
enumerable despite it is also associated bijectively to an infinite set of
positioned strings for matching purposes.

You can demonstrate this same property also easily for regexps containing
concatenations of regexps, or alternation of regexps, or any bounded or
unbounded kleen closure of a regexp (hint: use the inference by recursion).

Matching the regexp "a+" against a string (input_text) is then computing the
intersection between:
* the union of all subsets like PS("a"), PS("aa"), PS("aaa")... : this set
has infinite cardinality
* and PS(input_text) : this enumerable set is finite if input_text is
finite.
The result is also an enumerable set of positioned strings.

All valid regexp operators are acting on enumerable sets of positioned
strings, and they return enumerable sets of positioned strings. Using an API
like nextMatch() is just implementing an iterator on such sets, by ordering
it: it is trivial to apply the same ordering on the infinite set of
positioned strings associated to the regexp independently of the input_text,
and so to support the API with two simple state variables (the current start
positions, growing when match longer strings from this position becomes
impossible due to the finite input_text length constraint, and the length of
the current match):

You don't need to enumerate completely the infinite set associated to the
regexp itself because the length of the input_text is finite, and defines
the maximum length of the intervals in the returned intersection.

Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"
Previous message: Hans Aberg: "Re: FYI: Regex paper for UTC"
In reply to: Hans Aberg: "Re: FYI: Regex paper for UTC"
Next in thread: Hans Aberg: "Re: FYI: Regex paper for UTC"
Reply: Hans Aberg: "Re: FYI: Regex paper for UTC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 16:23:05 CDT