RE: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)

From: Philippe Verdy ([email protected])
Date: Tue Oct 02 2007 - 12:04:22 CST

Next message: Philippe Verdy: "RE: Fish (was Re: Marks)"

Previous message: Michael Maxwell: "RE: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mark Davis: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Mark Davis: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Mark Davis: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote:
> Trying to parse your language, what I read you as saying is that a
different
> equivalence operator could be used instead of the slashes, like
> propname~value instead of propname=/value/

If I also try to parse your language (that introduces the new concept of
"equivalence operator") I still don't see any difference you are seeing
between propname=value (which I correctly termed using "equals" or
"equality") and propname~value.

What do you really mean when you write "propname=/value/"? Is it a
containment relation (what you seem to call now "equivalence") or equality
relation?
If we accepted both your proposal 2 (multiple values, which is just a
particular kind of matching a regexp) and 3 (matching a regexp), then the
slashes in proposal 3 are superfluous around the regexp value (or become
optional and just complicate the syntax without any change in what will be
matched by the regexp).

So I don't see any semantic difference between "propname=value" and your
proposed "propname=/value/" as soon as value regexp are accepted. The only
important question is: do property values need to be matched according to a
regular expression or must they be matched only by equality.

In your example: \N{MARK} would match nothing because there's no Unicode
character named "MARK". If you want to match characters whose name
*contains* the word "MARK", then you just need to include a ".*" prefix and
suffix: "\N{.*MARK.*}".

Note that the Unicode character names have known constraints (according to
the stability rules for the assignment of unique names):
* they use only the letters [A-Z], the digits [0-9] and the space and
hyphen.
* Letter case is not significant (so "\N{SPACE}" would match the same thing
as "\N{space}")
* leading and trailing spaces or multiple spaces or hyphens are not
significant (so "\N{LATIN SMALL LETTER A}" would match the same thing as
"\N{ latin small letter - a }")
* the words "letter" or "digit" or "mark" are non significant.
* Other spaces and hyphens are normally not significant, so they can be
removed from the name (but there's one exception for one Hangul vowel whose
name makes a distinction between "O E" and "O-E")

So implicitly, when matching a name property value in a regexp character
class, the subregexp for the value can be compiled using case insensitive
rules and possibly weaker rules (according to the Unicode constraints
above). These are just global compilation behaviour, but we probably don't
need to complicate the syntax for something that is already invariable, and
there's no need to introduce a new "equivalence" operator for the specific
need of matching character names, given that the regexp already specify
encode the property name using "\N{...}" or "\p{name=...} that clearly
indicates we are trying to match Unicode character names (or sequences).

Suppose you want to look for all characters that contain the *words* acute
accent. I would just encode it as:
\N{<ACUTE>} or \p{name=<ACUTE>} or as well \p{name=<acute>}
(the angle brackets here are part of the regexp value to match, and are
representing here a word boundary, replace them by the appropriate syntax
used in the regexp)
But I won't need the extra superfluous delimiting slashes in:
\N{/<ACUTE>/} or \p{name=/<ACUTE>/}
(it will match not only the combining accent itself, but also precomposed
characters with an acute accent and whose name contain the "ACUTE" word.

We can then create a negated character class matching all characters that
don't contain the same *words* using simply:
\N{!<ACUTE>} or \p{name!=<ACUTE>} or \p{name!=<ACUTE>} or \p{^name=<ACUTE>}
or \P{name=<ACUTE>}
(the multiple possibilities come from the number of alternate notations you
support for classes of character names, or for negated classes, I'm not
saying which of them will be the preferred one.)

But I won't need any superfluous delimiting slashes around the regexp value
as suggested for your proposal 3:
\N{/!<ACUTE>/} or \p{name!=/<ACUTE>/} or \p{name!=/<ACUTE>/} or
\p{^name=/<ACUTE>/} or \P{name=/<ACUTE>/}

So your proposal 3 to support regexp values is good, I just don't see the
interest of introducing slashes here when you don't need them in your
proposal 2 (your argument about complication for the case of multiple values
supported by your proposal 2 is not relevant: we are already in the context
of evaluating regular expressions, so the complications are already
implemented elsewhere in the regexp parser and in the matching engine, and
will need yo be supported anyway for accepting the proposal 3, i.e. regexp
values).

Next message: Philippe Verdy: "RE: Fish (was Re: Marks)"
Previous message: Michael Maxwell: "RE: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mark Davis: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"
Next in thread: Mark Davis: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Mark Davis: "Re: Proposal for additional syntax (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 12:07:33 CST