RE: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 02 2007 - 11:05:58 CST

  • Next message: Asmus Freytag: "Re: New Public Review Issue: Proposed Update UTS #18"

    Mark Davis Wrote:
    > Also, there were some interesting suggestions for syntax additions
    > that may be worth mentioning in informative text.
    > 1. not equals
    > As well as
    > * \P{propname=value} and [:^propname=value:]
    > to have:
    > * \{propname!=value}, \p{propname≠value}
    > * [:propname!=value:], [:propname≠value:]

    I'm not sure that \{propname!=value} should be defined, or recommended or
    even suggested: its contextual parsing may complicate things, unlike the 4
    others that use a distinctive prefix that helps avoiding conflicts with the
    various use of the {} notation.

    Also, it conflicts with many other frequent uses of "\{" as the only way to
    escape the litteral "{" character itself, when "{ ... }" has a special
    meaning in the supported regexp syntax for creating a distinction from "(
    ... )" for non-capturing groups, or for allowing non-matching spaces to be
    used as visual interpretation hints in complex regexps (within those "{ ...
    }" non-capturing groups, the litteral spaces that need to be matched by the
    regexp will need to be escaped, just like other braces that need to be
    interpreted literally as a matching rule instead of their default special
    grouping semantic).

    Also you propose mixing \p and \P for similar use. The only good suggestion
    is the way to represent the "different" relation using an alternate operator
    replacing the equal sign, instead of using a leading negation (using a
    capital \P instead of \p, or a leading ^ operator in a class notation)
    before the encoded equality.

    For the rest, the "[: ... :]" bracketing is easily perceived everywhere as
    equivalent to the "{ ... }" bracketing (but having to support it looks much
    like the use of multiple characters for representing the same "character" in
    programming languages using national versions of ISO 646 that did not have
    the "{ }" braces in their encoding. It looks ugly (but is used in POSIX
    regexps).

    > 2. multiple values(...)
    > * \p{gc=L|M|Nd} instead of [\p{gc=L}\p{gc=M}\p{gc=Nd}]

    Good suggestion but it is quite related to your suggestion 3:

    > 3. regex values
    > * propname=/regexForValue/
    > eg
    > * \p{name=/MARK/} or equivalently \N{/MARK/}

    So multiple values would also be encoded using your suggestion 3 as:
     * \p{gc=/L|M|Nd/}

    What do you mean in \p{name=/MARK/} : does this indicate that is will match
    any character whose property value "equals" the matched regexp, or
    "contains" the regexp. I would not suggest the "contains" meaning, this is
    not needed because it should be:
     * \p{name=/.*MARK.*/}

    But then, why are the slashes needed? If you look at suggestion 2, the
    leading and trailing slash is not used, but the multiple values are also
    encoded as a regexp. So your suggestion 3 (regexp values) could as well be
    supported using the notation in suggestion 2:
     * \p{name=MARK} or equivalently \N{MARK}

    If you need to encode the "constains" relation rather than the "equals"
    relation, I think this relation should be encoded explicitly:
     * \p{name=.*MARK.*} or equivalently \N{.*MARK.*}
    At least like this, this does not change the reading of the "=" operator as
    "equals" in the notation, which can then be replaced where needed by a
    "different" operator or negated assertion containing the "=" operator
    (related to "does not contain" if there's a regexp in the value starting and
    finishing by ".*")



    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 11:09:27 CST