RE: FYI: Regex paper for UTC

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 13 2007 - 19:54:11 CDT

  • Next message: Philippe Verdy: "RE: Emoticons (was: Root and fraction (2 new symbols))"

    Hans Aberg wrote:
    > So you don't need this, I think. Essentially, what you are describing
    > is a new closure operator x$B!-(B, with the definition perhaps L(x') =
    > {y $B":(B C*| y is a substring of a string in L(x) }. (To be a "closure
    > operator", it should satisfy L(x'') = L('x).) Then your complement is
    > ~(x'), where L(~x) := C* \ L(x). But I am not sure of the usability
    > of this x' operator.

    Note exactly, because in my implementation y is not a simple string in C*,
    it is a set of positioned strings; and I need to match the positions too.

    (the positions are given as a tuple of half-open integer intervals, with as
    many member intervals in the tuple as there are capturing groups in the
    regexp plus one for matching the whole regexp itself);

    So it is more like the set:
        {( y, ( [s_0,e_0), [s_1,e_1), ... ) ) | y $B":(B C*
                  && i $B":(B [0, x.nbcaptures()]
                  && s_i $B":(B [0, y.length()]
                  && e_i $B":(B [s, y.length()]
                  && y = t.substring(s_i, e_i)
                  && t $B":(B L(x)
        }
    which defines (more or less) a set of positioned strings within L(x)...

    The tuple of positions is ((s_i, e_i), ...), here defined for each capture
    number i by the start index and end index in source text t of each captured
    interval (whose associated substring value in source text is y).

    Capture number i=0 is used to represent the position of the match for the
    whole regexp.

    There are other conditions for these captured intervals, hard to write in a
    single formula, but may be clearer using an Haskell syntax, or an expression
    of lambda-calculus. The definition is limpid in the Java source, but still
    too long for this email.

    To simplify this, I'll just suppose that captures are not supported, so each
    tuple of intervals of position become a single interval of positions:

        {( y, [s,e) ) | y $B":(B C*
                  && s $B":(B [0, y.length()]
                  && e $B":(B [s, y.length()]
                  && y = t.substring(s_i, e_i)
                  && t $B":(B L(x)
        }
    A regexp by itself captures all intervals of positions for the same
    substring matched (for example the RE "a" match equally the "a" in "bar" or
    each "a" in "Madagascar", but they are from distinct strings, and at
    distinct positions. On the opposite matching the regexp "a" on "Madagascar"
    returns in one atomic operation the set of intervals in Madagascar where
    there's a "a".

    More complex regexps would return multiple substrings with distinct values,
    each one associated with a set of distinct intervals, the two sets are
    mapped into positioned matches using a partial Cartesian product where a
    substring matched by the regexp is associated with all the intervals where
    the match occurred. Adding captures is even more complex because the
    returned "positioned matches" include the value and positions of each
    capture.

    Now the set of "positioned matches" can be ordered, but this is external to
    the expression of the regexp itself. Basically, all matches can occur in any
    order within the returned set.

    Note also that searching matches with the Regexp "(a*)(a*)" in "aaaaa" will
    return 5 possible strings: "", "a", "aa", "aaa", "aaaaa". Each one occurs at
    several intervals in the source string. But even for these intervals, there
    are different positions for the captured subgroups. The regexp will return
    one match for each combination. Removing the unnecessary matches is part of
    the filtering on output, but if only one interval is expected, you need some
    ordering of the set.

    The "leftmost first" rule further restricted by the "longest match" is the
    most wanted rule, but that is not the only choice (the longest match rule
    exposes a security issue for never returning and accumulating infinite
    length strings in an internal buffer). But even in that case, you need a
    priority order because the "longest match first" rule will conflict between
    two successive capturing groups as in "(a*)(a*)" if parentheses are used to
    capture two groups (with alternations, the interaction between capturing
    groups may even be more complex with partial overrides of these groups)

    Such strategy for generalizing regexps is not exposed in the Aho/Seti/Ullman
    book that just indicates a few tracks to follow after exposing some ideas to
    solve the problem. I've made sure that the generalization of regexps still
    remains decidable in finite, non combinatorial, time between each returned
    match.

    But these strategies are part of the implementation in the Purdue Compilers
    Construction Tools Set that mixes regexps and contextual parsers in a single
    generalized language (the language actually uses a suite of production rules
    that are all defined by generalized regexps, including the support for
    custom transition tests on partial matches), one of the very rare sets of
    tools that can successfully parse the complete FORTRAN syntax with optional
    space separators between tokens (something that Lex and Yacc can't fully
    recognize unambiguously like in "FORINDEX(1)=1TO10" where custom matching
    rules are needed with several capturing groups to match several lexical
    tokens at the same time the statement syntax has been partially validated.)
    I studied PCCTS nearly 20 years ago, when it was still written in C (now in
    Java and ported to .Net and best known now as ANTLR and integrated in
    ANTLRWorks).

    If you don't know ANTLR, have a look at it, it's certainly the most powerful
    regexp engine, based on strong mathematics backgrounds on formal languages.
    Once you know it, you will wonder why you learned to use Lex (whose regexps
    are derived from legacy ed/vi/sed) and Yacc/Bison (limited to LALR
    languages) that force you to generate code difficult to maintain.



    This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 19:56:31 CDT