Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Fri Sep 21 2007 - 12:55:20 CDT

Next message: Kenneth Whistler: "RE: Normalization in panlingual application"

Previous message: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> > allowing multiple values in a property definition such as
> \p{gc=L|M|N} or \p{nv>=10}.
>
> Allowing multiple values is a nice way to compact the regex. Similarly,
> in my implementation I actually allow a regex within the property value,
> so for example have \p{name=/.*MARK.*/} to pick up all the Unicode
> characters with "MARK" in their name. A bit squirrely, but very handy.
> We might mention some of these techniques as possibilities.

I allow a regular expression for the block, name, old name, and script
properties: \p{Script=/latin|common|inherited/}, and the match is not
anchored, so you could just say \p{name=/mark/} (and they're also case
insensitive).

Comparisons are allowed for Numeric_Value, Canonical_Combining_Class,
and Age. The comparisons are : or = for equality, or != < <= > >=.
\p{Age>=5.0} (Age is a special case where '=' means '<=')

Other properties use : or = for equality or != for inequality:
\p{East_Asian_Width!=Wide}

> As far as your other comments (copied below), the issue is as to what
> [^a-z ñ \q{ch} \q{ll} \q{rr}] would mean. Here was roughly our reasoning.
>
> * The meaning, without the ^, is a set of strings {"a", "b", ...,
> "z", "ñ", "ch", "ll", "rr"}.
> * The set inversion would be the set of all other strings. So that
> would include "0", "A", ... but also "New York", and
> "onomotopaeic", and so on. An infinite set.
> * So a match against /[^x...x]/ would be the equivalent of
> /(?![x...x]) .*/, and match, for example, this entire email.
>
> That would change the semantics of regex very substantially. Conversely,
> if we define [^x...x] as equivalent to [[\u0000-\u0010FFFF] - [x...x]]
> it is well-defined, and matches current regex usage for the cases where
> no grapheme clusters are involved.

The way I approached it is a bit different. A Spanish user would
consider "ch", "ll", and "rr" to be single characters, for example.
If those characters had their own code points, then we wouldn't
need to do anything special to represent [a-z ñ ch ll rr], and
nobody would suggest removing any of the characters when creating
the inverse of this character class.

They don't have their own code points, though, so implementations
need to be able to handle grapheme clusters in character classes.
I think it's wrong to remove the grapheme clusters from a character
class when negating it, because then those "characters" won't be
represented properly.

I disagree with your reasoning that "the set inversion would be the
set of all other strings," and would change it to, "the set inversion
would be the set of all other -characters-." So "0" and "A" would be
in the inverse set, but "New York" would not.

I would change your last point to:
      * So a match against /[^x...x]/ would be the equivalent of
        /(?![x...x])./ (I removed the *) and match the first
        character of this email.
This is almost how my implementation works, and now that I see this,
I may change it to be exactly this (where . matches a grapheme cluster).
Right now my code will only match the next code point, which could be
only a part of a grapheme cluster, such as the A in A + ACUTE, leaving
a dangling mark. On the other hand, you may only want to match the
next code point, so I'll need to think about this some more....

One more feature I worked on was a way to specify that . would match
ch, ll, or rr as a single character. I came up with (?.ch.ll.rr) as
the possible syntax, but haven't implement it yet.

Mike

Next message: Kenneth Whistler: "RE: Normalization in panlingual application"
Previous message: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
In reply to: Mark Davis: "Re: New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Reply: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 12:58:09 CDT