RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 09 2007 - 16:58:54 CDT

  • Next message: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"

    Hans Aberg [mailto:haberg@math.su.se] wrote:
    > > You can't answer to these questions as it interacts with the possible
    > > (optional and unspecified) rules of the longest match or leftmost
    > > match if
    > > they are implemented, or with the ordering of matches when multiple
    > > matches
    > > are returned in a sequence or with enumerating methods like
    > > "regexp.nextMatch()" or "regexp.previousMatch()".
    >
    > What is you point? - DFAs can be constructed for those.

    Exactly! What's the point of DFA here (it's implementation defined, but not
    related to the language itself)?

    Ordering of matches in results is a semantic issue (unlike DFA which implies
    not changes to the semantic), completely independant of the way your matcher
    works, whever it uses a completely flattened DFA or not (because such
    flattened DFA is in fact not needed, and explodes in a combinatorial way,
    meaning longcompilation time andunbound memory resources).

    And yes I know we can build a flattened DFA, but in most cases this is not
    needed, not even for performance (in fact I have measured that the flattened
    DFA, as soon as it starts exploding in size, is degrading the performance
    due to data cache misses and the need to use large tables through random
    indirections, a good reason not to use it for regexps).

    Just to make things clear: in which order the matches for in
    "([0-9]+)([0-9+]" will be returned in this string: "12345"?
    The result set constains these 10 matches:

    "12"
    "123"
    "1234"
    "12345"
    "23"
    "234"
    "2345"
    "34"
    "345"
    "45"

    Now if we use parentheses for noting in the results the capturing groups,
    the same set of 10 results is decomposed like this into 20 matches:
    "(1)(2)"
    "(1)(23)", "(12)(3)"
    "(1)(234)", "(12)(34)", "(1)(234)"
    "(1)(2345)", "(12)(345)", "(123)(45)", "(1234)(5)"
    "(2)(3)"
    "(2)(34)", "(23)(4)"
    "(2)(345)", "(23)(45)", "(234)(5)"
    "(3)(4)"
    "(3)(45)", "(34)(5)"
    "(4)(5)"

    If ordering is not defined, then all results are equally valid
    If ordering is considered by the client application, it may simply take only
    the first match and ignore the rest, so that the regexp will have different
    meanings. Note that here the simple rule of the longest match for a
    capturing group does not work (it works only for regexps without capturing
    groups, i.e. a matching only the whole pattern as the default capturing
    group generally noted \0 or \& in many regexp implementations).



    This archive was generated by hypermail 2.1.5 : Tue Oct 09 2007 - 17:02:10 CDT