RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 16:21:01 CDT

  • Next message: Philippe Verdy: "RE: FYI: Regex paper for UTC"

    Hans Aberg [mailto:haberg@math.su.se] wrote:
    > Envoyé : samedi 13 octobre 2007 21:57
    > À : verdy_p@wanadoo.fr
    > Cc : Mark Davis; Unicode List
    > Objet : Re: FYI: Regex paper for UTC
    >
    > On 13 Oct 2007, at 21:36, Philippe Verdy wrote:
    >
    > >> That operation is not very useful, because the language complement of
    > >> say a single character c is the set of all other strings. So if one
    > >> is finding the longest string in the language from a point on, and
    > >> the string isn't c, all will be eaten.
    > >
    > > No, such operation is typically used in association with a "&&"
    > > operator
    > > that restricts the set of matchable strings. They are used also for
    > > matching
    > > left and right contexts without including these contexts in the
    > > returned
    > > match.
    > >
    > > But as you said, "all will be eaten" ONLY IF "the string is not c",
    > > so the
    > > effect of negation is NOT producing the whole set of possible texts.
    >
    > The problem is that any string starting with c will also be in the
    > language complement and matched.

    No, it won't. A string containing a c (at any position, and in any non-null
    number of positions), contains a match for the regexp "c", so it is not part
    of the complement.

    The expression "a -- b" matches "a" only if it does not contain "b". Both
    "a" and "b" are tried in parallel (trying to match "b" is performed
    repeatedly on every position), and as soon as a match on "b" is found, it
    eliminates the ongoing candidate match on "a".

    The "--" and "~" operators are not operators acting on a single instance of
    the language, they are set operations acting on ALL possible matches
    simultaneously, i.e. in the domain of sets of strings, not on the domain of
    isolated strings.

    I perceive all regexps to be idempotent (linked bijectively) to the set of
    strings where they find ***AT LEAST*** one match. The empty regexp is
    idempotent to the whole set of strings (including the empty string).

    When matching a regexp against a text, the result is not a single match, not
    even just the longest one, it is the complete set of all matches found in
    that text.

    Then if you want to find just the leftmost shortest match, you just need to
    fix an ordering on top of this set of matches, by ordering them first by
    growing initial position, then by decreasing length: when you do that, you
    have a well defined "nextMatch()" method for your need (so that you don't
    need to build the complete set associated to the regexp, because ordered
    sets of matches are iterable using the ordering condition, i.e. here the
    initial position of the last match and its length).

    You may as well want to find the leftmost longest match, in a method named
    for example nextLongestMatch(), by calling repeatedly the nextMatch() method
    described above as long as it returns the same initial position for the
    returned interval; when another initial position is found or no further
    match is returned by nextMatch(), store this last match interval for the
    next call to nextLongestMatch(), and return the result of the last
    nextMatch(); otherwise return a NULL match (this is how you would filter out
    the set of matches).

    The suite of calls to nextMatch() until it returns NULL will iterate over
    the complete set of matches associated to the regexp and found in its input
    text.

    In other words, matching a regexp against an input text is defined as
    computing the intersection of two sets:
    * the complete set of strings generated by the regexp itself
    * the complete set of unique intervals within the input text.

    To compute the intersection,
    * each string in the regexp set is uniquely associated in a (interval,
    string) pair defining an infinite but enumerable set, with a "any-interval"
    interval
    * each distinct intervals in the input text is associated in a (interval,
    string) pair defining an element of the set to intersect with the previous
    set.

    The result is then effectively a set of (interval, string) pairs.

    For regexps that support N capturing groups, you just need to replace the
    concept of "interval" by "N-tuple of intervals" (which are also enumerable)
    in the definition above.

    More generally, a regexp is not distinct from a (possibly infinite) set of
    strings:
    : RE("a") = {"a"};
    : RE("a{0,3}") = {"", "a", "aa", "aaa"};
    : RE("a+") = {"a", "aa", "aaa", ..., "aaaa...a", ... };
    : RE("a*") = {"", "a", "aa", "aaa", ..., "aaaa...a", ... };
    And each string element of these sets are bijectively associated to the
    infinite set PS of positioned strings, for example:
    : PS("aa") = {
        ((0,0),"aa"),
        ((0,1),"aa"), ((1,0),"aa"),
        ((0,2),"aa"), ((1,1),"aa"), ((2,0),"aa"),
        ...
      };
    Here the intervals are noted as (initial position, length) pairs (the way we
    can enumerate this set in the general case is by fixing a value for
    position+length and varyingthe position between 0 and this sum, before
    incrementing this sum and restarting; however if the sum has an upper bound
    MAX, the easiest way is to use the ordered enumeration by fixing a start
    position between 0 and MAX, and varying length from 0 to MAX-position: this
    is the ordered enumeration used generally for finding matches).

    So the regexp is also bijectively associated with the union of these PS()
    sets which are enumerable, and so, orderable.

    You can easily demonstrate that regexps containing a finite character class
    (containing a finite number of characters, because the input alphabet is
    finite and a character class is included in this input alphabet) is also
    enumerable despite it is also associated bijectively to an infinite set of
    positioned strings for matching purposes.

    You can demonstrate this same property also easily for regexps containing
    concatenations of regexps, or alternation of regexps, or any bounded or
    unbounded kleen closure of a regexp (hint: use the inference by recursion).

    Matching the regexp "a+" against a string (input_text) is then computing the
    intersection between:
    * the union of all subsets like PS("a"), PS("aa"), PS("aaa")... : this set
    has infinite cardinality
    * and PS(input_text) : this enumerable set is finite if input_text is
    finite.
    The result is also an enumerable set of positioned strings.

    All valid regexp operators are acting on enumerable sets of positioned
    strings, and they return enumerable sets of positioned strings. Using an API
    like nextMatch() is just implementing an iterator on such sets, by ordering
    it: it is trivial to apply the same ordering on the infinite set of
    positioned strings associated to the regexp independently of the input_text,
    and so to support the API with two simple state variables (the current start
    positions, growing when match longer strings from this position becomes
    impossible due to the finite input_text length constraint, and the length of
    the current match):

    You don't need to enumerate completely the infinite set associated to the
    regexp itself because the length of the input_text is finite, and defines
    the maximum length of the intervals in the returned intersection.



    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 16:23:05 CDT