**From:** Philippe Verdy (*verdy_p@wanadoo.fr*)

**Date:** Sat Oct 13 2007 - 19:54:11 CDT

**Previous message:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**In reply to:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Next in thread:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Reply:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

Hans Aberg wrote:

*> So you don't need this, I think. Essentially, what you are describing
*

*> is a new closure operator x$B!-(B, with the definition perhaps L(x') =
*

*> {y $B":(B C*| y is a substring of a string in L(x) }. (To be a "closure
*

*> operator", it should satisfy L(x'') = L('x).) Then your complement is
*

*> ~(x'), where L(~x) := C* \ L(x). But I am not sure of the usability
*

*> of this x' operator.
*

Note exactly, because in my implementation y is not a simple string in C*,

it is a set of positioned strings; and I need to match the positions too.

(the positions are given as a tuple of half-open integer intervals, with as

many member intervals in the tuple as there are capturing groups in the

regexp plus one for matching the whole regexp itself);

So it is more like the set:

{( y, ( [s_0,e_0), [s_1,e_1), ... ) ) | y $B":(B C*

&& i $B":(B [0, x.nbcaptures()]

&& s_i $B":(B [0, y.length()]

&& e_i $B":(B [s, y.length()]

&& y = t.substring(s_i, e_i)

&& t $B":(B L(x)

}

which defines (more or less) a set of positioned strings within L(x)...

The tuple of positions is ((s_i, e_i), ...), here defined for each capture

number i by the start index and end index in source text t of each captured

interval (whose associated substring value in source text is y).

Capture number i=0 is used to represent the position of the match for the

whole regexp.

There are other conditions for these captured intervals, hard to write in a

single formula, but may be clearer using an Haskell syntax, or an expression

of lambda-calculus. The definition is limpid in the Java source, but still

too long for this email.

To simplify this, I'll just suppose that captures are not supported, so each

tuple of intervals of position become a single interval of positions:

{( y, [s,e) ) | y $B":(B C*

&& s $B":(B [0, y.length()]

&& e $B":(B [s, y.length()]

&& y = t.substring(s_i, e_i)

&& t $B":(B L(x)

}

A regexp by itself captures all intervals of positions for the same

substring matched (for example the RE "a" match equally the "a" in "bar" or

each "a" in "Madagascar", but they are from distinct strings, and at

distinct positions. On the opposite matching the regexp "a" on "Madagascar"

returns in one atomic operation the set of intervals in Madagascar where

there's a "a".

More complex regexps would return multiple substrings with distinct values,

each one associated with a set of distinct intervals, the two sets are

mapped into positioned matches using a partial Cartesian product where a

substring matched by the regexp is associated with all the intervals where

the match occurred. Adding captures is even more complex because the

returned "positioned matches" include the value and positions of each

capture.

Now the set of "positioned matches" can be ordered, but this is external to

the expression of the regexp itself. Basically, all matches can occur in any

order within the returned set.

Note also that searching matches with the Regexp "(a*)(a*)" in "aaaaa" will

return 5 possible strings: "", "a", "aa", "aaa", "aaaaa". Each one occurs at

several intervals in the source string. But even for these intervals, there

are different positions for the captured subgroups. The regexp will return

one match for each combination. Removing the unnecessary matches is part of

the filtering on output, but if only one interval is expected, you need some

ordering of the set.

The "leftmost first" rule further restricted by the "longest match" is the

most wanted rule, but that is not the only choice (the longest match rule

exposes a security issue for never returning and accumulating infinite

length strings in an internal buffer). But even in that case, you need a

priority order because the "longest match first" rule will conflict between

two successive capturing groups as in "(a*)(a*)" if parentheses are used to

capture two groups (with alternations, the interaction between capturing

groups may even be more complex with partial overrides of these groups)

Such strategy for generalizing regexps is not exposed in the Aho/Seti/Ullman

book that just indicates a few tracks to follow after exposing some ideas to

solve the problem. I've made sure that the generalization of regexps still

remains decidable in finite, non combinatorial, time between each returned

match.

But these strategies are part of the implementation in the Purdue Compilers

Construction Tools Set that mixes regexps and contextual parsers in a single

generalized language (the language actually uses a suite of production rules

that are all defined by generalized regexps, including the support for

custom transition tests on partial matches), one of the very rare sets of

tools that can successfully parse the complete FORTRAN syntax with optional

space separators between tokens (something that Lex and Yacc can't fully

recognize unambiguously like in "FORINDEX(1)=1TO10" where custom matching

rules are needed with several capturing groups to match several lexical

tokens at the same time the statement syntax has been partially validated.)

I studied PCCTS nearly 20 years ago, when it was still written in C (now in

Java and ported to .Net and best known now as ANTLR and integrated in

ANTLRWorks).

If you don't know ANTLR, have a look at it, it's certainly the most powerful

regexp engine, based on strong mathematics backgrounds on formal languages.

Once you know it, you will wonder why you learned to use Lex (whose regexps

are derived from legacy ed/vi/sed) and Yacc/Bison (limited to LALR

languages) that force you to generate code difficult to maintain.

**Next message:**Philippe Verdy: "RE: Emoticons (was: Root and fraction (2 new symbols))"**Previous message:**Philippe Verdy: "RE: FYI: Regex paper for UTC"**In reply to:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Next in thread:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Reply:**Hans Aberg: "Re: FYI: Regex paper for UTC"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Sat Oct 13 2007 - 19:56:31 CDT
*