Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Sun Sep 23 2007 - 12:03:51 CDT

  • Next message: Mike: "Re: Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"

    Philippe Verdy wrote:
    > The first intutitive approach to what [^set] means is that it should match
    > everywhere [set] does not match, and [set] should match everywhere [^set]
    > doest not match, i.e. they should be perfect complementary of each other.

    Sorry, Philippe, I just responded to your later message, and didn't
    realize you had said this. This is exactly how I implemented [^set]
    in my code, and as you say, it is intuitive. We should strive to
    have intuitive behavior; the opposite of intuitive is 'obtuse' or
    'unnatural'.

    > But already, they are aren't perfect complements because both will exclude
    > line terminators in multiline mode.

    This is not true; [^abc] should match a line terminator. Unless
    you do something like [[^abc] & \p{L}].

    > Now if you accept digraphs or grapheme clusters in [set], you should accept
    > them also in [^set] and "." should also include all digraphs and grapheme
    > clusters, but this means that "." will need to include all possible texts,
    > because digraphs are not limited in size. As this seems unreasonable
    > (because it would make counting the number of matches with "." impossible to
    > perform), it seems reasonable to exclude the possibility of using digraphs
    > in [set].

    I played around with the ability to add digraphs to "." and came up
    with two methods. The first would be to specifically list them using
    syntax such as:

        (?.ch.ll.rr) # . now matches "ch" "ll" and "rr" as single entities

    Or you could specify a locale:

        \l{es} # adds digraphs from Spanish locale to .

    I don't yet support locales in my code, but I have reserved \l for
    that purpose.

    > So the idea of implanting regexps by making them find matchs in the NFD
    > transformation of the input text is good as it creates a conforming process.
    > The bad thing is that E WITH ACUTE is no more a single character and is then
    > absent from the "." universe and can't be part of [set] and [^set].

    In my code, both the pattern and input text are converted to NFD, and
    "." will match E WITH ACUTE as a single character (two code points).
    This is done by keeping track of where the default grapheme cluster
    boundaries are.

    > Another possibility is to include in the "." universe the NFD transformation
    > of every code point of the UCS, in such a way that the sequence <C,
    > COMBINING ACUTE ACCENT> is still counted as 1 unit, but <C, COMBINING ACUTE
    > ACCENT, COMBINING CEDILLA > will be counted as 2 "." units (but then
    > remember that "." is sensitive to Unicode versions).

    Using grapheme cluster boundaries is an easier way to do this, and
    allows you to match any combining character sequence, whether there
    is a code point assigned to it or not. You're correct that this
    depends on the Unicode version since the grapheme cluster boundaries
    depend on it.

    > But then, should a search for <C WITH COMBINING ACUTE ACCENT>, equivalent to
    > a search for <C, COMBINING ACUTE ACCENT> in NFD form, will easily match the
    > text <C WITH COMBINING ACUTE ACCENT, COMBINING CEDILLA>, but should it match
    > the text encoded as <C WITH CEDILLA, COMBINING ACUTE ACCENT>, which is
    > canonically equivalent? If the intent is to produce a Unicode conforming
    > process, it should be yes. So matches will be for two non-contiguous
    > subtrings in the scanned text, excluding the CEDILLA part !

    I am experimenting with requiring a match to start and end on grapheme
    cluster boundaries, thus a search for C WITH ACUTE will not match
    C WITH ACUTE + CEDILLA or C WITH CEDILLA + ACUTE.

    I have a problem I need to figure out, though, and that is if you want
    to add \m* to cause a match to occur (\m* means 'plus any other marks').
    If the NFD is <C, CEDILLA, ACUTE> and you try to match C + ACUTE + \m*,
    the intervening CEDILLA causes this not to match; I need to figure out
    a way to cause this to match....

    Mike



    This archive was generated by hypermail 2.1.5 : Sun Sep 23 2007 - 12:06:44 CDT