# RE: FYI: Regex paper for UTC

Date: Sat Oct 13 2007 - 19:54:11 CDT

• Next message: Philippe Verdy: "RE: Emoticons (was: Root and fraction (2 new symbols))"

Hans Aberg wrote:
> So you don't need this, I think. Essentially, what you are describing
> is a new closure operator x\$B!-(B, with the definition perhaps L(x') =
> {y \$B":(B C*| y is a substring of a string in L(x) }. (To be a "closure
> operator", it should satisfy L(x'') = L('x).) Then your complement is
> ~(x'), where L(~x) := C* \ L(x). But I am not sure of the usability
> of this x' operator.

Note exactly, because in my implementation y is not a simple string in C*,
it is a set of positioned strings; and I need to match the positions too.

(the positions are given as a tuple of half-open integer intervals, with as
many member intervals in the tuple as there are capturing groups in the
regexp plus one for matching the whole regexp itself);

So it is more like the set:
{( y, ( [s_0,e_0), [s_1,e_1), ... ) ) | y \$B":(B C*
&& i \$B":(B [0, x.nbcaptures()]
&& s_i \$B":(B [0, y.length()]
&& e_i \$B":(B [s, y.length()]
&& y = t.substring(s_i, e_i)
&& t \$B":(B L(x)
}
which defines (more or less) a set of positioned strings within L(x)...

The tuple of positions is ((s_i, e_i), ...), here defined for each capture
number i by the start index and end index in source text t of each captured
interval (whose associated substring value in source text is y).

Capture number i=0 is used to represent the position of the match for the
whole regexp.

There are other conditions for these captured intervals, hard to write in a
single formula, but may be clearer using an Haskell syntax, or an expression
of lambda-calculus. The definition is limpid in the Java source, but still
too long for this email.

To simplify this, I'll just suppose that captures are not supported, so each
tuple of intervals of position become a single interval of positions:

{( y, [s,e) ) | y \$B":(B C*
&& s \$B":(B [0, y.length()]
&& e \$B":(B [s, y.length()]
&& y = t.substring(s_i, e_i)
&& t \$B":(B L(x)
}
A regexp by itself captures all intervals of positions for the same
substring matched (for example the RE "a" match equally the "a" in "bar" or
each "a" in "Madagascar", but they are from distinct strings, and at
distinct positions. On the opposite matching the regexp "a" on "Madagascar"
returns in one atomic operation the set of intervals in Madagascar where
there's a "a".

More complex regexps would return multiple substrings with distinct values,
each one associated with a set of distinct intervals, the two sets are
mapped into positioned matches using a partial Cartesian product where a
substring matched by the regexp is associated with all the intervals where
the match occurred. Adding captures is even more complex because the
returned "positioned matches" include the value and positions of each
capture.

Now the set of "positioned matches" can be ordered, but this is external to
the expression of the regexp itself. Basically, all matches can occur in any
order within the returned set.

Note also that searching matches with the Regexp "(a*)(a*)" in "aaaaa" will
return 5 possible strings: "", "a", "aa", "aaa", "aaaaa". Each one occurs at
several intervals in the source string. But even for these intervals, there
are different positions for the captured subgroups. The regexp will return
one match for each combination. Removing the unnecessary matches is part of
the filtering on output, but if only one interval is expected, you need some
ordering of the set.

The "leftmost first" rule further restricted by the "longest match" is the
most wanted rule, but that is not the only choice (the longest match rule
exposes a security issue for never returning and accumulating infinite
length strings in an internal buffer). But even in that case, you need a
priority order because the "longest match first" rule will conflict between
two successive capturing groups as in "(a*)(a*)" if parentheses are used to
capture two groups (with alternations, the interaction between capturing
groups may even be more complex with partial overrides of these groups)

Such strategy for generalizing regexps is not exposed in the Aho/Seti/Ullman
book that just indicates a few tracks to follow after exposing some ideas to
solve the problem. I've made sure that the generalization of regexps still
remains decidable in finite, non combinatorial, time between each returned
match.

But these strategies are part of the implementation in the Purdue Compilers
Construction Tools Set that mixes regexps and contextual parsers in a single
generalized language (the language actually uses a suite of production rules
that are all defined by generalized regexps, including the support for
custom transition tests on partial matches), one of the very rare sets of
tools that can successfully parse the complete FORTRAN syntax with optional
space separators between tokens (something that Lex and Yacc can't fully
recognize unambiguously like in "FORINDEX(1)=1TO10" where custom matching
rules are needed with several capturing groups to match several lexical
tokens at the same time the statement syntax has been partially validated.)
I studied PCCTS nearly 20 years ago, when it was still written in C (now in
Java and ported to .Net and best known now as ANTLR and integrated in
ANTLRWorks).

If you don't know ANTLR, have a look at it, it's certainly the most powerful
regexp engine, based on strong mathematics backgrounds on formal languages.
Once you know it, you will wonder why you learned to use Lex (whose regexps
are derived from legacy ed/vi/sed) and Yacc/Bison (limited to LALR
languages) that force you to generate code difficult to maintain.

This archive was generated by hypermail 2.1.5 : Sat Oct 13 2007 - 19:56:31 CDT