RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 08 2007 - 03:39:45 CDT

Next message: Richard Ishida: "UniView updated"

Previous message: Javier SOLA: "Re: Khmer, display of consonant shifters"
In reply to: Andy Heninger: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Philippe Verdy: "RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Consider the expression
> (ch | hz | a-f)+
> and the text
> chz
>
> POSIX matching would match the entire string.
> Perl (and Java, ICU, etc.) would match only the ch.

It depends on how many match you want to accept. If the regexp.match()
method will just return the first match, allowing you to continue (like in
Java or ICU), then it will return *all* the matches, but in an ordered way;
so these languages will return successively "ch" (according to first
alternative), then "chz" (according to the last alternative), then "hz" (the
second alternative).

There will possibly exist other order for these matches, but generally, if
the matcher needs to return all matches, it must do that by iterating on
them first nby extending the length, before advancing the start position, if
the matcher wants to avoid all backtracking.

If the user just wants ONE match, then the question of order becomes
important, and returning the longest match makes sense here (so "chz" is the
best choice.

Now if you don't need the exact matches, but just, for example to find all
lines that contain some pattern, the order in which the matches are found
within a line plays no role.

On the opposite, the order of the matches found plays a critical important
role if matches are used for substitutions: you could adopt the ed/vi/sed
behaviour and return always the longest match on the line, before applying a
subtitution, but then repeat again on the result, but you risk tocrete an
infinite loop (for example with "s/a/ba/g" would in that case try inserting
an infinite number of "b" before the first a found in a line, by trying to
substitute "a", matched by the regexp /a/, into "ba").

Such thing is for now not discussed in the UTS, despite it plays a critical
role for applications using regexp matchers. To avoid creating such loop in
substitution rules, it is critical to include a rule during substitution
that forces the matcher to advance after this whole match, if it has been
substituted. But if substitution is finally not performed (because it
depends on some other external condition, testing the value of the match
according, before determining that this is a "good" match, there's no reason
to stop finding other matches, because it may be useful as well.

In the POSIX's ed/sed/vi approach, the left-most longest match is preferred,
and each time a match is found, the regexp matcher restarts scanning by
resting its internal state as if it was just after the beginning of the
file; but for the evaluation of the matches to return to the application,
the other approach, (i.e. first returning the left-most shortest match and
not resting the state) is clearly used, and is much more powerful and needed
to allow matching regexps that contain aggregate operatots (like "*" or "?"
or "+" or "{m,n}") in the middle, and to support also the capturing groups.

Next message: Richard Ishida: "UniView updated"
Previous message: Javier SOLA: "Re: Khmer, display of consonant shifters"
In reply to: Andy Heninger: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Philippe Verdy: "RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 08 2007 - 04:11:28 CDT