Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Mark Davis (
Date: Tue Oct 02 2007 - 14:52:09 CST

  • Next message: Rick McGowan: "Unicode mail list shut down today for upgrade"

    As far as I can tell, you are saying that there needs to be no syntactic
    difference between specifying an exact match and specifying with regex
    matching. I am never sure, however, because your messages are often so
    difficult for me to understand that I give up after the first paragraph.

    Anyway, assuming that this is what you are saying, I disagree, since there
    are two different operations.

    1. the set of characters whose names are (exact match to) "LATIN CAPITAL
    LETTER A". This is already provided for as \p{name=LATIN CAPITAL LETTER A},
    and is the same as [\u0061]
    2. the set of characters that contain "LATIN CAPITAL LETTER A" (so will
    match stuff that follows with " WITH DIAERESIS" and so on). This needs new
    syntax, because it has to be different than the syntax already used in #1.


    On 10/2/07, Philippe Verdy <> wrote:
    > Mark Davis wrote:
    > > Trying to parse yourlanguage, what I read you as saying is that a
    > different
    > > equivalence operator could be used instead of the slashes, like
    > > propname~value instead of propname=/value/
    > If I also try to parse your language (that introduces the new concept of
    > "equivalence operator") I still don't see any difference you are seeing
    > between propname=value (which I correctly termed using "equals" or
    > "equality") and propname~value.
    > What do you really mean when you write "propname=/value/"? Is it a
    > containment relation (what you seem to call now "equivalence") or equality
    > relation?
    > If we accepted both your proposal 2 (multiple values, which is just a
    > particular kind of matching a regexp) and 3 (matching a regexp), then the
    > slashes in proposal 3 are superfluous around the regexp value (or become
    > optional and just complicate the syntax without any change in what will be
    > matched by the regexp).
    > So I don't see any semantic difference between "propname=value" and your
    > proposed "propname=/value/" as soon as value regexp are accepted. The only
    > important question is: do property values need to be matched according to
    > a
    > regular expression or must they be matched only by equality.
    > In your example: \N{MARK} would match nothing because there's no Unicode
    > character named "MARK". If you want to match characters whose name
    > *contains* the word "MARK", then you just need to include a ".*" prefix
    > and
    > suffix: "\N{.*MARK.*}".
    > Note that the Unicode character names have known constraints (according to
    > the stability rules for the assignment of unique names):
    > * they use only the letters [A-Z], the digits [0-9] and the space and
    > hyphen.
    > * Letter case is not significant (so "\N{SPACE}" would match the same
    > thing
    > as "\N{space}")
    > * leading and trailing spaces or multiple spaces or hyphens are not
    > significant (so "\N{LATIN SMALL LETTER A}" would match the same thing as
    > "\N{ latin small letter - a }")
    > * the words "letter" or "digit" or "mark" are non significant.
    > * Other spaces and hyphens are normally not significant, so they can be
    > removed from the name (but there's one exception for one Hangul vowel
    > whose
    > name makes a distinction between "O E" and "O-E")
    > So implicitly, when matching a name property value in a regexp character
    > class, the subregexp for the value can be compiled using case insensitive
    > rules and possibly weaker rules (according to the Unicode constraints
    > above). These are just global compilation behaviour, but we probably don't
    > need to complicate the syntax for something that is already invariable,
    > and
    > there's no need to introduce a new "equivalence" operator for the specific
    > need of matching character names, given that the regexp already specify
    > encode the property name using "\N{...}" or "\p{name=...} that clearly
    > indicates we are trying to match Unicode character names (or sequences).
    > Suppose you want to look for all characters that contain the *words* acute
    > accent. I would just encode it as:
    > \N{<ACUTE>} or \p{name=<ACUTE>} or as well \p{name=<acute>}
    > (the angle brackets here are part of the regexp value to match, and are
    > representing here a word boundary, replace them by the appropriate syntax
    > used in the regexp)
    > But I won't need the extra superfluous delimiting slashes in:
    > \N{/<ACUTE>/} or \p{name=/<ACUTE>/}
    > (it will match not only the combining accent itself, but also precomposed
    > characters with an acute accent and whose name contain the "ACUTE" word.
    > We can then create a negated character class matching all characters that
    > don't contain the same *words* using simply:
    > \N{!<ACUTE>} or \p{name!=<ACUTE>} or \p{name!=<ACUTE>} or
    > \p{^name=<ACUTE>}
    > or \P{name=<ACUTE>}
    > (the multiple possibilities come from the number of alternate notations
    > you
    > support for classes of character names, or for negated classes, I'm not
    > saying which of them will be the preferred one.)
    > But I won't need any superfluous delimiting slashes around the regexp
    > value
    > as suggested for your proposal 3:
    > \N{/!<ACUTE>/} or \p{name!=/<ACUTE>/} or \p{name!=/<ACUTE>/} or
    > \p{^name=/<ACUTE>/} or \P{name=/<ACUTE>/}
    > So your proposal 3 to support regexp values is good, I just don't see the
    > interest of introducing slashes here when you don't need them in your
    > proposal 2 (your argument about complication for the case of multiple
    > values
    > supported by your proposal 2 is not relevant: we are already in the
    > context
    > of evaluating regular expressions, so the complications are already
    > implemented elsewhere in the regexp parser and in the matching engine, and
    > will need yo be supported anyway for accepting the proposal 3, i.e. regexp
    > values).


    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 14:54:29 CST