Re: Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Andy Heninger (andy.heninger@gmail.com)
Date: Sun Oct 07 2007 - 20:27:47 CDT

  • Next message: Javier SOLA: "Re: Khmer, display of consonant shifters"

    On 10/5/07, Mark Davis <mark.davis@icu-project.org> wrote:
    >
    > Going back to my doc, and tweaking it, I think what we are saying is that
    >
    > [a-z \q{ch} \q{rr}]
    >
    > is equivalent to
    >
    > ( ch | rr | [a-z] )
    >
    > in matching, so however the latter works, the former should work.

    How an expression like ( ch | rr | [a-z] ) works is in fact different
    between the Perl family of regular expression implementations and the POSIX
    family

    Consider the expression
       (ch | hz | a-f)+
    and the text
       chz

    POSIX matching would match the entire string.
    Perl (and Java, ICU, etc.) would match only the ch.

    Do we care? I don't know. Maybe not, maybe the situation doesn't come up
    often enough to worry about. Ignoring the differences has the practical
    advantage that each type of implementation could actually do an
    implementation within the structure of what they already have.

     -- Andy

    I think complement is the only really tricky problem.
    >
    > Mark
    >
    > On 10/5/07, Andy Heninger <andy.heninger@gmail.com> wrote:
    > >
    > >
    > >
    > > On 10/4/07, Mike <mike-list@pobox.com > wrote:
    > > >
    > > > > With strings in sets at all, separately from the question of how to
    > > > do
    > > > > set negation, I'm not sure how matching should work. Which choice
    > > > is
    > > > > selected if more than one is possible? Should backtracking try
    > > > > additional choices if the first one doesn't lead to an overall
    > > > match?
    > > > > If sets don't have an implied ordering, do we need to require a
    > > > POSIX
    > > > > style longest match, which could be slow?
    > > >
    > > > In a set, I keep track of the strings by their length, so the longest
    > > > match is always found. I don't think you want to backtrack and try a
    > > > shorter string since the longer match is supposed to be treated as a
    > > > unit....
    > > >
    > > > > Should the set [^xyz\q{ch}] match the 'c' in "ch" ?
    > > >
    > > > I don't think so; since the \q{ch} matches "ch", the negated set does
    > > > not match at the first position.
    > >
    > >
    > > The choices you have made seem reasonable to me.
    > >
    > > But what would implementations with a DFA (non-backtracking)
    > > implementation do? It would be very difficult for them to not take a
    > > shorter string from a set if that led to an overall longer match. Would it
    > > be OK - still useful- if the UTS left what happens unspecified?
    > >
    > > -- Andy
    > >
    > >
    > > > I'm half inclined to move strings, or literal clusters, into section
    > > > 3,
    > > > > then move the entire section 3 of UTS-18 into a separate document
    > > > for
    > > > > interesting, but not fully worked out, ideas.
    > > >
    > > > This seems like a good idea.
    > > >
    > > > Mike
    > > >
    > >
    > >
    >
    >
    > --
    > Mark



    This archive was generated by hypermail 2.1.5 : Sun Oct 07 2007 - 20:32:09 CDT