Re: New Public Review Issue: Proposed Update UTS #18

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Sep 21 2007 - 11:32:47 CDT

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

Previous message: Erkki I. Kolehmainen: "Upcoming Meeting on Multilingual Extensions to the Regional Keyboard Layouts"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> allowing multiple values in a property definition such as \p{gc=L|M|N} or
\p{nv>=10}.

Allowing multiple values is a nice way to compact the regex. Similarly, in
my implementation I actually allow a regex within the property value, so for
example have \p{name=/.*MARK.*/} to pick up all the Unicode characters with
"MARK" in their name. A bit squirrely, but very handy. We might mention some
of these techniques as possibilities.

As far as your other comments (copied below), the issue is as to what [^a-z
ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.

   - The meaning, without the ^, is a set of strings {"a", "b", ..., "z",
   "ñ", "ch", "ll", "rr"}.
   - The set inversion would be the set of all other strings. So that
   would include "0", "A", ... but also "New York", and "onomotopaeic", and so
   on. An infinite set.
   - So a match against /[^x...x]/ would be the equivalent of
   /(?![x...x]) .*/, and match, for example, this entire email.

That would change the semantics of regex very substantially. Conversely, if
we define [^x...x] as equivalent to [[\u0000-\u0010FFFF] - [x...x]] it is
well-defined, and matches current regex usage for the cases where no
grapheme clusters are involved.

However, there may well be other useful alternatives that should be
considered. So perhaps you can set out your suggestions in more detail. (For
now, we can keep the discussion on this list; if it starts to get too boring
for others we can collected together the interested parties and do
off-line.)

from Mike:

    A typical implementation of the inverse of a set containing
    literal clusters simply removes those strings, thus
    [^a-z ñ \q{ch} \q{ll} \q{rr}] is equivalent to [^a-z ñ].

I think this is bad implementation advice, and leads to strange
behavior. In the example given, the behavior will be correct since
all of the clusters begin with a letter also contained in the class.
However, if you consider a character class containing only clusters,
e.g. [^\q{ch} \q{ll} \q{rr}], simply removing the clusters will
result in an empty character class that matches -anything-. This
is incorrect behavior as it should not match the beginning of the
word "chile" for instance.

The way I implemented this was to create a "normal" character class
containing all the listed characters and grapheme clusters, and
then invert the result of the match operation. The classes above
would match "chile" in the first position, and thus return a "no
match" result.

On 9/20/07, Mike <mike-list@pobox.com> wrote:
>
> > As regards Mike's new concerns with the language regarding multiline
> > mode matching, I suggest that he post that to the feedback
> > form, and it will be rolled up into the feedback document
> > that will be considered by the UTC for this PRI.
>
> I have done that, and Rick has verified that the feedback
> was received. I also included more of my implementation
> details such as \m for combining marks, \i for ideographic
> characters, and allowing multiple values in a property
> definition such as \p{gc=L|M|N} or \p{nv>=10}.
>
> Mike
>
>

-- 
Mark

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Previous message: Erkki I. Kolehmainen: "Upcoming Meeting on Multilingual Extensions to the Regional Keyboard Layouts"
In reply to: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Reply: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 11:37:08 CDT