RE: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 08 2007 - 17:08:50 CDT

  • Next message: Doug Ewell: "Re: Alignment of IANA language subtag registry to ISO 639-3"

    On 2 Oct 2007, at 03:32, Mark Davis wrote:
    > Getting back to this topic, given the upcoming meeting we
    > should prepare some kind of summary and submit it.
    > (Nothing on this list is seen by the UTC unless someone
    > writes up a proposal or submits feedback.)
    >
    > Here is my attempt (grabbing some text from an email of
    > Sep 23). Please let me know if you have any feedback before
    > I send it in.
    > There are a few viable approaches to matching negated sets.
    > Let's take [a-z \q{ch} \q{rr}] as an example. It is
    > productive to look at the "unrolled version of this, using
    > the fact that the following are equivalent (ignoring
    > capturing for the moment):
    > [a-z \q{ch} \q{rr}]
    > and
    > ( [a-z] | ch | rr )

    Agreed. If something is standardized for negated sets, it
    should not contradict the rules for matching the second one
    because it is intuitively clear, and both should be perceived
    as completely equivalent (you can ignore the capturing issue,
    notably because many regexp implementations do not assume
    implicit capturing for parentheses, but require marking
    more explicitly the groups you want to capture, and keep
    the simple parentheses only as non-capturing groups.

    So the question becomes really: What does a negated *regexp* match?

    It is not as simple as you think, because any regexp can also be a part of
    another larger regexp.
    There are even case where regexps are used to match only conditional
    contexts before or after another regexp, and it's clear that in this context
    (notably for the regexps used in "before-contexts") that thy should not
    implicitly match the longest text.

    So the negated regexp matches BOTH the set of strings containing "ch", and
    the set of strings not containing "ch". In a regexp implementation that will
    be used to find all matches, and not just the first one, they will be both
    returned, so "ch" will be returned, but in different matching contexts. In
    implementations that just return the first match, the ordering of possible
    matches becomes extremely important.

    For example with the regexp "a*", the order of matches in text "aaaaaaaa"
    matters, and the total number of possible matches is combinatorial, and you
    can't avoid the question: which match should the regexp return first?
    Ordering of matches is not something that a regexp specifies itself, it is
    part of the regexp-engine behaviour (because a regexp normally includes ALL
    the possibilities: the result of matching the regexp "a*" on "aaaaaaaa" is
    an unordered set of matches, and only the implementation of the engine MAY
    provide an ordering; then if the result of this operation is ordered and not
    performed in a single canonical operation, the client of this matcher may
    forget to look at the other possibilities, discarding all other possible
    matches).

    If we look at the question of negated sets, it's clear that "ch" will be
    returned in an implementation that returns all matches in a single atomic
    operation. But if the result is ordered, then we can make distinctions, by:
    * ordering all the possible matches according to the position in the source
    text (this is the most productive order as it can be made unique and stable)
    * ignoring the order of alternative clauses in the regexp:

    ( [a-z] | ch | rr ) should be treated as completely identical to ( ch | rr |
    [a-z] )
    And this should be also the same for the negated regexp.

    Suppose that you want to determine in a regexp which match should com first,
    this is to be performed by ordering constraints in the regexp itself. But
    the simplest form of alternatives within parentheses or within [] sets does
    not clearly state a required ordering of these matches (that's why it's no
    reasonnable to think that it specifies an ordering of matches.

    Adding specific syntax to the regexp, in order to control the way it will
    order the matches will have a severe cost on the implementation. Notably in
    this case:

    * the regexp matching engine will not be able to reduce its automata and to
    order the alternative clauses, meaning that local optimizations of the
    transition graph become nearly impossibleto peform due to the ordering
    constraints on matches; and
    * the regexp matching engine will no longer be allowed to reduce many
    "epsilon"-transitions in the transition graph (transitions that don't
    advance by matching any source character from the text, but just advance
    from one matching context to the next);

    or
    * the regexp matching engine will have to support backtracking;
    or even worse:
    * it will not behave in a predictable way and may forget many matches !

    If you want to control such ordering of matches, you need to remodel what is
    a regexp:
    * It is no longer a function that takes a regexp specification string and a
    text, and that returns an unordered ***set*** of matches
    * It is now a function that will return a **fully ordered** set of matches,
    i.e. a ***vector*** of unique matches (this is equivalent to returning a
    ***map*** from natural integers to a specific match value or to a "no-match"
    value.

    The second alternative has several variants that are idempotent (and then
    semantically equivalent). Let's say that the regexp function takes a third
    parameter that represents its internal state and returns its new state after
    returning only the first match of the fully ordered set of match, then:
    * the function can return a ***list*** of unique matches. The initial state
    is the empty list, and the state used as a parameter of the regexp function
    is the list previously returned: it just needs to discard the initial node
    of the list, to set its initial condition for finding the next match.

    In that case, the regexp function behaves like a "monad" (In object
    modelling, used in imperative languages like C, C++, Java..., monads are
    modelled simply as instances that store their internal state, and a method
    that takes the internal state of the instance as an additional input, and
    then updates the internal state so that it is its output; in functional
    languages, the input and output are physically separated and are immutable,
    that's where "monads" are more formally used to describes functions whose
    first input parameter represents the state, and first output parameter
    represents the new state.)

    At first glance this seems out-of-topic; but regexps are really all these.
    The most general definition of regexp engines being the last one, where they
    are functions taking three parameters (an initial state, a regexp
    specification string, the input stream) and returning two (the new state and
    a match in the input stream). The effect of "negating" a regexp (or a [set]
    of characters or collation elements) is not so intuitive as it looks like,
    and you need to consider and give value to the most general case, where the
    order of matches is currently NOT specified by the syntax itself.

    The effect of ordering in negated sets becomes very visible when regexps are
    composed within larger regexps, this requires a semantic for the composition
    of regexps, which is not defined directly by the operation of concatenation
    of texts that regexps are parsing:

    Intuitively, the effect of concatenating two regexps should match *all*
    (non-defective) texts that are the concatenation of two texts matched
    separately by the two regexps. When I say "non-defective" here, it means
    that there's no collation element spanning the two texts being concatenated.
    Such limitation disappears if you view texts as being vectors of immutable
    collation elements, and if you allow breaking the texts only on collation
    element boundaries, so that they can be parsed by separate regexps. If such
    splitting as a side effect, it should not be permitted, or should be given a
    clear semantic, and that's why the intuitive semantic of the concatenation
    of two regexps is not as simple as the concatenation of the texts they
    match.



    This archive was generated by hypermail 2.1.5 : Mon Oct 08 2007 - 17:12:37 CDT