RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 21:14:52 CST

  • Next message: Philippe Verdy: "RE: Fish (was Re: Marks)"

    Mike wrote:
    > >> All that matters is
    > >> that you find the longest match. [a-z\q{ch}] will match "ch"
    > >> in "chinchilla" rather than just "c".
    > >
    > > And what can you do with the negated class? How do you define it
    > > consistently?
    >
    > A negated class matches wherever the non-negated class doesn't
    > match and vice-versa. Haven't I said this numerous times?

    Yes. And I have said it too, you can reread my messages, I have not denied
    it! That's exactly the reason why we need special behavior for collation
    elements when they are part of a negated class or a negated union.

    I personally make no difference between classes written with [] and unions
    written with (...|...|...). I have already a regexp matcher that works
    exactly this way. Although it still does not recognized special classes
    defined by Unicode character property values, nothing prevents me from
    adding it. My Regexp parser compiles the regexp into a simple vector of
    32bit ints containing either single codepoints, or a primitive (coded as
    negative, without any bitset because it offers far better performance).
    There are only 4 operator primitives for encoding unions, disjonctions,
    intersection and negations. There's a special primitive value for encoding
    transitions for ranges of characters but I'll add another for encoding tests
    of character properties (that define special classes).

    The compiled vector takes a size in O(n), if n is the length of the source
    Regexp, with a factor at worst of 4. With the very reduced set of
    primitives, there's no complex switches and tests for transitions. The whole
    regexp parser if fully driven by tables. I am currently also adapting the
    regexp parser so that it will compile several Regexp syntaxes (this will
    also be table driven, so that the number of Regexp syntaxes it will support
    will be easily extensible). It supports compilation flags too (for
    compatibility with ed/sed/vi or WinWord flags, or Emacs flags, or PCRE and
    PHP flags...) I also want to extend the parser for the flags string so that
    it will support also the table-driven syntaxes.

    Note that it supports "restartable" matches (including backwards searches,
    something really difficult with regexps, but also matching under user-driven
    conditions not stored in the regexp itself but evaluated through callbacks,
    such as computing the number of matches according to other fields already
    matched, this uses a special parameter value for one of the 4 operator, i.e.
    the disjonction operator primitive; this is exactly this feature that will
    allow me to evaluate Unicode character properties and reply if a transition
    is or is not a match; this operator evaluates both branches of disjonctions
    as valid states until the matcher reaches a final state in the transition
    graph).

    My code has absolutely no equivalent with other implementations. I have
    generalized all the concepts so that it can more easily optimize the speed
    of transitions by arranging the various character classes, and also support
    multiple locales. For now it cannot compile complete collation tailoring
    tables provided externally, but it supports the 4-level DUCET and the
    single-level binary collation order. May be I'll integrate the support for
    the syntax of external collation tailoring rules, or I'll use the ICU
    implementation (this is not decided, there may be performance or design
    issues for handling the many internal state variables in the matcherwithout
    having to take very complex decisions).

    > > For what you are defining, you are not creating the necessary support
    > for
    > > correct support of locales, i.e. you are restricting yourself only to
    > the C
    > > locale, whose collation is defined strictly in binary order of code
    > points
    > > and nothing else (you are working in the "C" locale only.
    >
    > What I'm saying is that I think ranges of characters in character
    > classes denoted by '[' and ']' should be limited to binary order.

    Why "should"? This will be a good option only for the "C" locale, if you
    want to ignore all searches taking into account the linguistic semantics and
    script properties. My intent is to be able to perform searches including in
    complex contexts, like in CSV tables where there are columns containing some
    data containing numeric values, date values, or text values used as
    selectors, and combining them to look for matching rows, or positions in
    tables that match some regexp plus some specific locales. This requires of
    course extensions to the syntax, but the model for doing that with a regexp
    is ready, and I've already implemented each examplar feature (need to be
    completed for similar features).

    > If you want to define a character class that works according to a
    > locale, I'm suggesting that you want different syntax for that to
    > avoid confusion and unexpected behavior. My previous example of
    > Hawaiian [a-z] being equivalent to [aeiouhklmnpw] illustrates the
    > problem.

    Well. My regexp engine will support several syntaxes, including the basic
    one that you will document in the UTS... What I am suggesting to you is to
    limit the use of terms for "SHOULD" or "MUST", so that it will not forbid
    locale-sensitive searches using the same operators.

    > Obviously I don't have this restriction; this whole thread of
    > discussion started because I found a problem with the way the
    > UTS suggested an implementation of a negated character class
    > containing collation elements would work.

    And I see that we have different views for handling this problem in a
    consistent way. I preferred the approach that gives meaning to the most
    generic cases instead of focusing on a specific work-around.

    > > * You won't recognize any Unicode canonical equivalence in regexps. (But
    > > then why are you recognizing them in scanned texts? This is
    > inconsistent.)
    >
    > You are assuming details about my implementation that aren't true.
    > Right now both the regular expression and scanned text are internally
    > converted to NFD to handle canonical equivalents. The exception is
    > inside character classes. Again I'll use Hangul for the example:

    OK. But I'm not enforcing this on Regexps. I just wanted to make sure that
    every canonically equivalent strings representing regexps (in their own
    syntax) would return exactly the same set of matches from any texts that are
    canonically equivalent to each others.

    > Suppose you have the character class [\u1100-\u1110\u1161-\u1168],
    > but are able to enter the actual code points instead of using \u.
    > It represents a range of L code points, together with some V code
    > points. You are saying that the adjacent U+1110 and U+1161 should
    > combine together into a precomposed LV syllable, but that completely
    > messes up the meaning of the character class (making it nonsensical).
    >
    > > * You won't be able to recognize case-mappings consistently (for case
    > > insensitive searches), because collation elements will all be distinct
    > with
    > > only primary differences, and no further levels.
    >
    > For case-insensitivity, I compile the regular expression differently
    > and also perform NFD->CaseFold->NFD on the text. I still don't
    > understand why you need to think in terms of collation....

    That's because Case-Folding is not enough for locale-sensitive searches. We
    also need full case mapping, plus special case mappings including
    contractions or expansions. These can be handled in a mixed collation table,
    at the case-level (because I also need the multilevel capabilities of the
    UCA).

    > > * Even if you restrict to only the set of primary differences, the only
    > > case-mappings you will be able to recognize are the simple one-to-one
    > case
    > > mappings defined in the main UCD file, excluding special mappings (like
    > the
    > > consistent distictions of dotted and undotted Turkic "I"... or finding
    > > matches containing "SS" when a German sharp s is specified in the search
    > > string... or allowing matches for final variants of greek letters)
    >
    > I don't see why these cases wouldn't work if they are supported by
    > CaseFolding.txt. I have tried matching SHARP S with "ss" and it
    > works.

    It breaks the ordering (I want to be able, for example, to perform searches
    like "give me all matches where this field matched by a subregexp has a
    value in a range specified by two strings", not only by simple ranges in
    classes. Also the collation will support matches by numeric value (advanced
    collation). This is another reason why I don't make any difference, at
    matching time between classes and unions (they support exactly the same
    features, including aggregates, this is just another convenient abbreviated
    syntax to mean the same thing): I've generalized the concept by redefining
    classes to be unsorted sets of matches for any regexp, instead of just
    unsorted sets of matches for only 1 character. A whole regexp can now be
    viewed as a class given that all regexps are unions of matches using some
    exclusive of non-exclusive conditions.

    > This is tedious re-explaining things. My regex matcher is a
    > conforming process; it handles input text appropriately regardless
    > of normalization. Who ever said that the regular expression itself
    > needs to be invariant w.r.t. normalization? I have shown an example
    > of a character class that changes meaning if you convert it to NFC
    > (see above).

    And that's a caveat that I wanted to be able to avoid: I want full
    conformance for the regexp process, not just for the searched text inputs
    but also the regexp inputs.

    > > You could also disable finding the canonical equivalences by using
    > another
    > > flag, but then you must do it consistently, by disabling it BOTH in the
    > > input texts AND in the input regexp, but NOT only in texts like what you
    > > have done and you propose.
    >
    > I didn't propose that. (If I did, it was a mis-statement.)

    May be I mis-understood your statement then. You have just said that the
    conversion of a regexp to NFC will change the set of matches, so this is not
    conforming...

    > There is a minimum amount of complexity you need to deal with, and
    > I chose to use normalization instead of trying to figure out the
    > complete list of possible canonical equivalents (which if you have
    > n combining characters can explode as n!).

    I know that. My implementation avoids normalization steps in most cases,
    except when explicitly requested by using transition rules that depend on
    collation elements. This gives much more flexibility.

    > > It's not up to regexps to make normalizations, but it's up to regexps to
    > be
    > > able to recognize classes of canonically equivalent texts and find
    > identical
    > > sets of matches in this case if they want to be Unicode compliant
    > processes.
    >
    > I'm worn out....

    Well, It's true that the subject is quite complex to explain in clear terms.
    If you look into the documentation of PCRE Regexps, the documentation is
    most often less clear than experimenting yourself the various cases to
    understand what it means.

    Notably when looking at restartable conditions, or the rules for matching
    the leftmost longest sub-regexps and controlling how the fields are matched
    and paired in regexps like /(x+)(x*)/ : how many "x" will you match in the
    first field and how many will you match for the second field when you use
    the longest matching rule?

    To disambiguate such thing, it requires some tricky syntax, or tricky global
    flags submitted to the regexp compiler. Such complication usually does not
    occur if you just want to find a list of full regexp matches, but occurs
    when you want to support substitution with subfields within the match.

    There are also tricky cases in regexps like /(a*b*c*)(\1)/ where you want
    the two delimited fields to be equal in a match. This becomes even trickier
    when you want to look for two fields that have related but distinct values,
    matching some other mutual condition (for this you need restartable
    matchers, backward capabilities, and this changes the way you represent the
    graph of transitions.

    (note for because of the complexity of regexps my model supports, I have
    included the possibility to include ignorable comments and layout spaces to
    make some parts readable, but most of the complexity will be hidden by the
    fact that I support separate definitions for subregexps, that are "imported"
    or included in the final regexp, like if it was a program this is enabled
    either by a global compilation flag or by a special grouping operator
    anywhere in the regexp itself: these comment parts are discarded when the
    regexpd are parsed, and don't augment the size in O(n) of the basic vector
    representing the compiled regexp as a microprogram).



    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 21:19:05 CST