RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (
Date: Mon Oct 08 2007 - 03:39:45 CDT

  • Next message: Richard Ishida: "UniView updated"

    > Consider  the expression
    >   (ch | hz | a-f)+
    > and the text
    >   chz
    > POSIX matching would match the entire string.
    > Perl (and Java, ICU, etc.) would match only the ch.

    It depends on how many match you want to accept. If the regexp.match()
    method will just return the first match, allowing you to continue (like in
    Java or ICU), then it will return *all* the matches, but in an ordered way;
    so these languages will return successively "ch" (according to first
    alternative), then "chz" (according to the last alternative), then "hz" (the
    second alternative).

    There will possibly exist other order for these matches, but generally, if
    the matcher needs to return all matches, it must do that by iterating on
    them first nby extending the length, before advancing the start position, if
    the matcher wants to avoid all backtracking.

    If the user just wants ONE match, then the question of order becomes
    important, and returning the longest match makes sense here (so "chz" is the
    best choice.

    Now if you don't need the exact matches, but just, for example to find all
    lines that contain some pattern, the order in which the matches are found
    within a line plays no role.

    On the opposite, the order of the matches found plays a critical important
    role if matches are used for substitutions: you could adopt the ed/vi/sed
    behaviour and return always the longest match on the line, before applying a
    subtitution, but then repeat again on the result, but you risk tocrete an
    infinite loop (for example with "s/a/ba/g" would in that case try inserting
    an infinite number of "b" before the first a found in a line, by trying to
    substitute "a", matched by the regexp /a/, into "ba").

    Such thing is for now not discussed in the UTS, despite it plays a critical
    role for applications using regexp matchers. To avoid creating such loop in
    substitution rules, it is critical to include a rule during substitution
    that forces the matcher to advance after this whole match, if it has been
    substituted. But if substitution is finally not performed (because it
    depends on some other external condition, testing the value of the match
    according, before determining that this is a "good" match, there's no reason
    to stop finding other matches, because it may be useful as well.

    In the POSIX's ed/sed/vi approach, the left-most longest match is preferred,
    and each time a match is found, the regexp matcher restarts scanning by
    resting its internal state as if it was just after the beginning of the
    file; but for the evaluation of the matches to return to the application,
    the other approach, (i.e. first returning the left-most shortest match and
    not resting the state) is clearly used, and is much more powerful and needed
    to allow matching regexps that contain aggregate operatots (like "*" or "?"
    or "+" or "{m,n}") in the middle, and to support also the capturing groups.

    This archive was generated by hypermail 2.1.5 : Mon Oct 08 2007 - 04:11:28 CDT