Re: New Public Review Issue: Proposed Update UTS #18

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Sep 21 2007 - 11:32:47 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    > allowing multiple values in a property definition such as \p{gc=L|M|N} or
    \p{nv>=10}.

    Allowing multiple values is a nice way to compact the regex. Similarly, in
    my implementation I actually allow a regex within the property value, so for
    example have \p{name=/.*MARK.*/} to pick up all the Unicode characters with
    "MARK" in their name. A bit squirrely, but very handy. We might mention some
    of these techniques as possibilities.

    As far as your other comments (copied below), the issue is as to what [^a-z
    ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.

       - The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
       "ñ", "ch", "ll", "rr"}.
       - The set inversion would be the set of all other strings. So that
       would include "0", "A", ... but also "New York", and "onomotopaeic", and so
       on. An infinite set.
       - So a match against /[^x...x]/ would be the equivalent of
       /(?![x...x]) .*/, and match, for example, this entire email.

    That would change the semantics of regex very substantially. Conversely, if
    we define [^x...x] as equivalent to [[\u0000-\u0010FFFF] - [x...x]] it is
    well-defined, and matches current regex usage for the cases where no
    grapheme clusters are involved.

    However, there may well be other useful alternatives that should be
    considered. So perhaps you can set out your suggestions in more detail. (For
    now, we can keep the discussion on this list; if it starts to get too boring
    for others we can collected together the interested parties and do
    off-line.)

    from Mike:

        A typical implementation of the inverse of a set containing
        literal clusters simply removes those strings, thus
        [^a-z ñ \q{ch} \q{ll} \q{rr}] is equivalent to [^a-z ñ].

    I think this is bad implementation advice, and leads to strange
    behavior. In the example given, the behavior will be correct since
    all of the clusters begin with a letter also contained in the class.
    However, if you consider a character class containing only clusters,
    e.g. [^\q{ch} \q{ll} \q{rr}], simply removing the clusters will
    result in an empty character class that matches -anything-. This
    is incorrect behavior as it should not match the beginning of the
    word "chile" for instance.

    The way I implemented this was to create a "normal" character class
    containing all the listed characters and grapheme clusters, and
    then invert the result of the match operation. The classes above
    would match "chile" in the first position, and thus return a "no
    match" result.

    On 9/20/07, Mike <mike-list@pobox.com> wrote:
    >
    > > As regards Mike's new concerns with the language regarding multiline
    > > mode matching, I suggest that he post that to the feedback
    > > form, and it will be rolled up into the feedback document
    > > that will be considered by the UTC for this PRI.
    >
    > I have done that, and Rick has verified that the feedback
    > was received. I also included more of my implementation
    > details such as \m for combining marks, \i for ideographic
    > characters, and allowing multiple values in a property
    > definition such as \p{gc=L|M|N} or \p{nv>=10}.
    >
    > Mike
    >
    >

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 11:37:08 CDT