RE: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 02 2007 - 12:04:22 CST

  • Next message: Philippe Verdy: "RE: Fish (was Re: Marks)"

    Mark Davis wrote:
    > Trying to parse your language, what I read you as saying is that a
    different
    > equivalence operator could be used instead of the slashes, like
    > propname~value instead of propname=/value/

    If I also try to parse your language (that introduces the new concept of
    "equivalence operator") I still don't see any difference you are seeing
    between propname=value (which I correctly termed using "equals" or
    "equality") and propname~value.

    What do you really mean when you write "propname=/value/"? Is it a
    containment relation (what you seem to call now "equivalence") or equality
    relation?
    If we accepted both your proposal 2 (multiple values, which is just a
    particular kind of matching a regexp) and 3 (matching a regexp), then the
    slashes in proposal 3 are superfluous around the regexp value (or become
    optional and just complicate the syntax without any change in what will be
    matched by the regexp).

    So I don't see any semantic difference between "propname=value" and your
    proposed "propname=/value/" as soon as value regexp are accepted. The only
    important question is: do property values need to be matched according to a
    regular expression or must they be matched only by equality.

    In your example: \N{MARK} would match nothing because there's no Unicode
    character named "MARK". If you want to match characters whose name
    *contains* the word "MARK", then you just need to include a ".*" prefix and
    suffix: "\N{.*MARK.*}".

    Note that the Unicode character names have known constraints (according to
    the stability rules for the assignment of unique names):
    * they use only the letters [A-Z], the digits [0-9] and the space and
    hyphen.
    * Letter case is not significant (so "\N{SPACE}" would match the same thing
    as "\N{space}")
    * leading and trailing spaces or multiple spaces or hyphens are not
    significant (so "\N{LATIN SMALL LETTER A}" would match the same thing as
    "\N{ latin small letter - a }")
    * the words "letter" or "digit" or "mark" are non significant.
    * Other spaces and hyphens are normally not significant, so they can be
    removed from the name (but there's one exception for one Hangul vowel whose
    name makes a distinction between "O E" and "O-E")

    So implicitly, when matching a name property value in a regexp character
    class, the subregexp for the value can be compiled using case insensitive
    rules and possibly weaker rules (according to the Unicode constraints
    above). These are just global compilation behaviour, but we probably don't
    need to complicate the syntax for something that is already invariable, and
    there's no need to introduce a new "equivalence" operator for the specific
    need of matching character names, given that the regexp already specify
    encode the property name using "\N{...}" or "\p{name=...} that clearly
    indicates we are trying to match Unicode character names (or sequences).

    Suppose you want to look for all characters that contain the *words* acute
    accent. I would just encode it as:
    \N{<ACUTE>} or \p{name=<ACUTE>} or as well \p{name=<acute>}
    (the angle brackets here are part of the regexp value to match, and are
    representing here a word boundary, replace them by the appropriate syntax
    used in the regexp)
    But I won't need the extra superfluous delimiting slashes in:
    \N{/<ACUTE>/} or \p{name=/<ACUTE>/}
    (it will match not only the combining accent itself, but also precomposed
    characters with an acute accent and whose name contain the "ACUTE" word.

    We can then create a negated character class matching all characters that
    don't contain the same *words* using simply:
    \N{!<ACUTE>} or \p{name!=<ACUTE>} or \p{name!=<ACUTE>} or \p{^name=<ACUTE>}
    or \P{name=<ACUTE>}
    (the multiple possibilities come from the number of alternate notations you
    support for classes of character names, or for negated classes, I'm not
    saying which of them will be the preferred one.)

    But I won't need any superfluous delimiting slashes around the regexp value
    as suggested for your proposal 3:
    \N{/!<ACUTE>/} or \p{name!=/<ACUTE>/} or \p{name!=/<ACUTE>/} or
    \p{^name=/<ACUTE>/} or \P{name=/<ACUTE>/}

    So your proposal 3 to support regexp values is good, I just don't see the
    interest of introducing slashes here when you don't need them in your
    proposal 2 (your argument about complication for the case of multiple values
    supported by your proposal 2 is not relevant: we are already in the context
    of evaluating regular expressions, so the complications are already
    implemented elsewhere in the regexp parser and in the matching engine, and
    will need yo be supported anyway for accepting the proposal 3, i.e. regexp
    values).



    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 12:07:33 CST