Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (
Date: Fri Sep 21 2007 - 12:55:20 CDT

  • Next message: Kenneth Whistler: "RE: Normalization in panlingual application"

    > > allowing multiple values in a property definition such as
    > \p{gc=L|M|N} or \p{nv>=10}.
    > Allowing multiple values is a nice way to compact the regex. Similarly,
    > in my implementation I actually allow a regex within the property value,
    > so for example have \p{name=/.*MARK.*/} to pick up all the Unicode
    > characters with "MARK" in their name. A bit squirrely, but very handy.
    > We might mention some of these techniques as possibilities.

    I allow a regular expression for the block, name, old name, and script
    properties: \p{Script=/latin|common|inherited/}, and the match is not
    anchored, so you could just say \p{name=/mark/} (and they're also case

    Comparisons are allowed for Numeric_Value, Canonical_Combining_Class,
    and Age. The comparisons are : or = for equality, or != < <= > >=.
       \p{Age>=5.0} (Age is a special case where '=' means '<=')

    Other properties use : or = for equality or != for inequality:

    > As far as your other comments (copied below), the issue is as to what
    > [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
    > * The meaning, without the ^, is a set of strings {"a", "b", ...,
    > "z", "ñ", "ch", "ll", "rr"}.
    > * The set inversion would be the set of all other strings. So that
    > would include "0", "A", ... but also "New York", and
    > "onomotopaeic", and so on. An infinite set.
    > * So a match against /[^x...x]/ would be the equivalent of
    > /(?![x...x]) .*/, and match, for example, this entire email.
    > That would change the semantics of regex very substantially. Conversely,
    > if we define [^x...x] as equivalent to [[\u0000-\u0010FFFF] - [x...x]]
    > it is well-defined, and matches current regex usage for the cases where
    > no grapheme clusters are involved.

    The way I approached it is a bit different. A Spanish user would
    consider "ch", "ll", and "rr" to be single characters, for example.
    If those characters had their own code points, then we wouldn't
    need to do anything special to represent [a-z ñ ch ll rr], and
    nobody would suggest removing any of the characters when creating
    the inverse of this character class.

    They don't have their own code points, though, so implementations
    need to be able to handle grapheme clusters in character classes.
    I think it's wrong to remove the grapheme clusters from a character
    class when negating it, because then those "characters" won't be
    represented properly.

    I disagree with your reasoning that "the set inversion would be the
    set of all other strings," and would change it to, "the set inversion
    would be the set of all other -characters-." So "0" and "A" would be
    in the inverse set, but "New York" would not.

    I would change your last point to:
          * So a match against /[^x...x]/ would be the equivalent of
            /(?![x...x])./ (I removed the *) and match the first
            character of this email.
    This is almost how my implementation works, and now that I see this,
    I may change it to be exactly this (where . matches a grapheme cluster).
    Right now my code will only match the next code point, which could be
    only a part of a grapheme cluster, such as the A in A + ACUTE, leaving
    a dangling mark. On the other hand, you may only want to match the
    next code point, so I'll need to think about this some more....

    One more feature I worked on was a way to specify that . would match
    ch, ll, or rr as a single character. I came up with (?.ch.ll.rr) as
    the possible syntax, but haven't implement it yet.


    This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 12:58:09 CDT