RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 08 2007 - 17:08:50 CDT

Next message: Doug Ewell: "Re: Alignment of IANA language subtag registry to ISO 639-3"

Previous message: Marion Gunn: "Re: Emoticons (was: Root and fraction (2 new symbols))"
In reply to: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2 Oct 2007, at 03:32, Mark Davis wrote:
> Getting back to this topic, given the upcoming meeting we
> should prepare some kind of summary and submit it.
> (Nothing on this list is seen by the UTC unless someone
> writes up a proposal or submits feedback.)
>
> Here is my attempt (grabbing some text from an email of
> Sep 23). Please let me know if you have any feedback before
> I send it in.
> There are a few viable approaches to matching negated sets.
> Let's take [a-z \q{ch} \q{rr}] as an example. It is
> productive to look at the "unrolled version of this, using
> the fact that the following are equivalent (ignoring
> capturing for the moment):
> [a-z \q{ch} \q{rr}]
> and
> ( [a-z] | ch | rr )

Agreed. If something is standardized for negated sets, it
should not contradict the rules for matching the second one
because it is intuitively clear, and both should be perceived
as completely equivalent (you can ignore the capturing issue,
notably because many regexp implementations do not assume
implicit capturing for parentheses, but require marking
more explicitly the groups you want to capture, and keep
the simple parentheses only as non-capturing groups.

So the question becomes really: What does a negated *regexp* match?

It is not as simple as you think, because any regexp can also be a part of
another larger regexp.
There are even case where regexps are used to match only conditional
contexts before or after another regexp, and it's clear that in this context
(notably for the regexps used in "before-contexts") that thy should not
implicitly match the longest text.

So the negated regexp matches BOTH the set of strings containing "ch", and
the set of strings not containing "ch". In a regexp implementation that will
be used to find all matches, and not just the first one, they will be both
returned, so "ch" will be returned, but in different matching contexts. In
implementations that just return the first match, the ordering of possible
matches becomes extremely important.

For example with the regexp "a*", the order of matches in text "aaaaaaaa"
matters, and the total number of possible matches is combinatorial, and you
can't avoid the question: which match should the regexp return first?
Ordering of matches is not something that a regexp specifies itself, it is
part of the regexp-engine behaviour (because a regexp normally includes ALL
the possibilities: the result of matching the regexp "a*" on "aaaaaaaa" is
an unordered set of matches, and only the implementation of the engine MAY
provide an ordering; then if the result of this operation is ordered and not
performed in a single canonical operation, the client of this matcher may
forget to look at the other possibilities, discarding all other possible
matches).

If we look at the question of negated sets, it's clear that "ch" will be
returned in an implementation that returns all matches in a single atomic
operation. But if the result is ordered, then we can make distinctions, by:
* ordering all the possible matches according to the position in the source
text (this is the most productive order as it can be made unique and stable)
* ignoring the order of alternative clauses in the regexp:

( [a-z] | ch | rr ) should be treated as completely identical to ( ch | rr |
[a-z] )
And this should be also the same for the negated regexp.

Suppose that you want to determine in a regexp which match should com first,
this is to be performed by ordering constraints in the regexp itself. But
the simplest form of alternatives within parentheses or within [] sets does
not clearly state a required ordering of these matches (that's why it's no
reasonnable to think that it specifies an ordering of matches.

Adding specific syntax to the regexp, in order to control the way it will
order the matches will have a severe cost on the implementation. Notably in
this case:

* the regexp matching engine will not be able to reduce its automata and to
order the alternative clauses, meaning that local optimizations of the
transition graph become nearly impossibleto peform due to the ordering
constraints on matches; and
* the regexp matching engine will no longer be allowed to reduce many
"epsilon"-transitions in the transition graph (transitions that don't
advance by matching any source character from the text, but just advance
from one matching context to the next);

or
* the regexp matching engine will have to support backtracking;
or even worse:
* it will not behave in a predictable way and may forget many matches !

If you want to control such ordering of matches, you need to remodel what is
a regexp:
* It is no longer a function that takes a regexp specification string and a
text, and that returns an unordered ***set*** of matches
* It is now a function that will return a **fully ordered** set of matches,
i.e. a ***vector*** of unique matches (this is equivalent to returning a
***map*** from natural integers to a specific match value or to a "no-match"
value.

The second alternative has several variants that are idempotent (and then
semantically equivalent). Let's say that the regexp function takes a third
parameter that represents its internal state and returns its new state after
returning only the first match of the fully ordered set of match, then:
* the function can return a ***list*** of unique matches. The initial state
is the empty list, and the state used as a parameter of the regexp function
is the list previously returned: it just needs to discard the initial node
of the list, to set its initial condition for finding the next match.

In that case, the regexp function behaves like a "monad" (In object
modelling, used in imperative languages like C, C++, Java..., monads are
modelled simply as instances that store their internal state, and a method
that takes the internal state of the instance as an additional input, and
then updates the internal state so that it is its output; in functional
languages, the input and output are physically separated and are immutable,
that's where "monads" are more formally used to describes functions whose
first input parameter represents the state, and first output parameter
represents the new state.)

At first glance this seems out-of-topic; but regexps are really all these.
The most general definition of regexp engines being the last one, where they
are functions taking three parameters (an initial state, a regexp
specification string, the input stream) and returning two (the new state and
a match in the input stream). The effect of "negating" a regexp (or a [set]
of characters or collation elements) is not so intuitive as it looks like,
and you need to consider and give value to the most general case, where the
order of matches is currently NOT specified by the syntax itself.

The effect of ordering in negated sets becomes very visible when regexps are
composed within larger regexps, this requires a semantic for the composition
of regexps, which is not defined directly by the operation of concatenation
of texts that regexps are parsing:

Intuitively, the effect of concatenating two regexps should match *all*
(non-defective) texts that are the concatenation of two texts matched
separately by the two regexps. When I say "non-defective" here, it means
that there's no collation element spanning the two texts being concatenated.
Such limitation disappears if you view texts as being vectors of immutable
collation elements, and if you allow breaking the texts only on collation
element boundaries, so that they can be parsed by separate regexps. If such
splitting as a side effect, it should not be permitted, or should be given a
clear semantic, and that's why the intuitive semantic of the concatenation
of two regexps is not as simple as the concatenation of the texts they
match.

Next message: Doug Ewell: "Re: Alignment of IANA language subtag registry to ISO 639-3"
Previous message: Marion Gunn: "Re: Emoticons (was: Root and fraction (2 new symbols))"
In reply to: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Hans Aberg: "Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Oct 08 2007 - 17:12:37 CDT