Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Mon Oct 01 2007 - 18:43:00 CST

  • Next message: Mark Davis: "Proposal for matching negated sets (was Re: New Public Review Issue: Proposed Update UTS #18)"

    >> All that matters is
    >> that you find the longest match. [a-z\q{ch}] will match "ch"
    >> in "chinchilla" rather than just "c".
    >
    > And what can you do with the negated class? How do you define it
    > consistently?

    A negated class matches wherever the non-negated class doesn't
    match and vice-versa. Haven't I said this numerous times?

    > For what you are defining, you are not creating the necessary support for
    > correct support of locales, i.e. you are restricting yourself only to the C
    > locale, whose collation is defined strictly in binary order of code points
    > and nothing else (you are working in the "C" locale only.

    What I'm saying is that I think ranges of characters in character
    classes denoted by '[' and ']' should be limited to binary order.
    If you want to define a character class that works according to a
    locale, I'm suggesting that you want different syntax for that to
    avoid confusion and unexpected behavior. My previous example of
    Hawaiian [a-z] being equivalent to [aeiouhklmnpw] illustrates the
    problem.

    > So in this restricted "C" locale:
    > * The classes of collation elements will only contain single code points
    > (and effectively, in that locale, there's no possible extension of the set
    > of collation elements which is exactly the range \u0000 to \u10FFFF in that
    > order, all of them with only primary differences, so they are equal to their
    > collation keys)

    Obviously I don't have this restriction; this whole thread of
    discussion started because I found a problem with the way the
    UTS suggested an implementation of a negated character class
    containing collation elements would work.

    > * You won't recognize any Unicode canonical equivalence in regexps. (But
    > then why are you recognizing them in scanned texts? This is inconsistent.)

    You are assuming details about my implementation that aren't true.
    Right now both the regular expression and scanned text are internally
    converted to NFD to handle canonical equivalents. The exception is
    inside character classes. Again I'll use Hangul for the example:

    Suppose you have the character class [\u1100-\u1110\u1161-\u1168],
    but are able to enter the actual code points instead of using \u.
    It represents a range of L code points, together with some V code
    points. You are saying that the adjacent U+1110 and U+1161 should
    combine together into a precomposed LV syllable, but that completely
    messes up the meaning of the character class (making it nonsensical).

    > * You won't be able to recognize case-mappings consistently (for case
    > insensitive searches), because collation elements will all be distinct with
    > only primary differences, and no further levels.

    For case-insensitivity, I compile the regular expression differently
    and also perform NFD->CaseFold->NFD on the text. I still don't
    understand why you need to think in terms of collation....

    > * Even if you restrict to only the set of primary differences, the only
    > case-mappings you will be able to recognize are the simple one-to-one case
    > mappings defined in the main UCD file, excluding special mappings (like the
    > consistent distictions of dotted and undotted Turkic "I"... or finding
    > matches containing "SS" when a German sharp s is specified in the search
    > string... or allowing matches for final variants of greek letters)

    I don't see why these cases wouldn't work if they are supported by
    CaseFolding.txt. I have tried matching SHARP S with "ss" and it
    works.

    > * and many other restrictions.

    ....

    > This may finally be consistant for the "C" locale (with binary order), but
    > you have not solved any of the linguistic needs, and even worse, your regexp
    > matcher cannot be a Unicode conforming process (because it will return
    > distinct sets of matches depending on the encoding or normalization or
    > non-normalization of input text and input regexps.)

    This is tedious re-explaining things. My regex matcher is a
    conforming process; it handles input text appropriately regardless
    of normalization. Who ever said that the regular expression itself
    needs to be invariant w.r.t. normalization? I have shown an example
    of a character class that changes meaning if you convert it to NFC
    (see above).

    > What you have done for now is a partial mix, which is intrinsicly
    > inconsistent, as soon as you have started converting input texts to NFD
    > (i.e. applying a normalization to them without applying the same rule to the
    > regexps...)

    See above.

    > I'm not advocating that Unicode regexps should support all locales. It
    > should support at least the legacy "C" locale (with binary order), and a
    > basic Unicode-based "U" locale (that is *reasonably* neutral to many
    > locales) based on the full set of Unicode properties, and the DUCET
    > collation elements (you have partly implemented it by recognizing many
    > Unicode properties, but not all those needed for consistency).

    What properties didn't I implement?

    > You could also disable finding the canonical equivalences by using another
    > flag, but then you must do it consistently, by disabling it BOTH in the
    > input texts AND in the input regexp, but NOT only in texts like what you
    > have done and you propose.

    I didn't propose that. (If I did, it was a mis-statement.)

    > However I don't think that normalization of input texts (to NFD in your
    > implementation) is the best way to handle the found matches, as
    > normalization will not only change the input text before scanning, but it
    > will also reorder parts of the input text, because it creates severe
    > difficulties for using the discovered matches, for example to apply
    > replacements or other Unicode transforms:

    There is a minimum amount of complexity you need to deal with, and
    I chose to use normalization instead of trying to figure out the
    complete list of possible canonical equivalents (which if you have
    n combining characters can explode as n!).

    > My opinion is that input texts should not be altered, and normalization
    > should only be performed on output, and only if it is explicitly part of the
    > transforms applied on matches, and even if normalization is not performed on
    > the whole output text (if needed a user can perform it separately, or by
    > specifying an optional flag that will be off by default).

    The normalization and case folding are done internally; the input
    text is not altered.

    > It's not up to regexps to make normalizations, but it's up to regexps to be
    > able to recognize classes of canonically equivalent texts and find identical
    > sets of matches in this case if they want to be Unicode compliant processes.

    I'm worn out....

    Mike



    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 18:46:24 CST