Re: UAX44: loose matching of symbolic values and the `is` prefix

From: Nova Patch <patch.nova_at_gmail.com>
Date: Mon, 6 Jun 2016 17:39:05 -0400

Den mandag 6. juni 2016 skrev Doug Ewell følgende:
>
> Mathias Bynens wrote:
>
> > The `is` prefix doesn’t provide any functionality that would otherwise
> > be unavailable. It doesn’t add any value, yet causes incompatibility,
> > author confusion, and it increases implementation complexity.
>
> I don't see any evidence that it adds no value. Support for existing
> implementations is value.

Markus has now confirmed that ICU doesn’t support this syntax and I can
confirm that even Perl, which probably supports the most different ways to
write the same regex, doesn’t support any form of the `is` prefix for
property values when the property name is provided.

$ perl -Mutf8 -E 'say "π" =~ /\p{Script=Greek}/'
1
$ perl -Mutf8 -E 'say "π" =~ /\p{Script=IsGreek}/'
Can't find Unicode property definition "Script=IsGreek" at -e line 1.
$ perl -Mutf8 -E 'say "π" =~ /\p{Script=Is_Greek}/'
Can't find Unicode property definition "Script=Is_Greek" at -e line 1.

Although Perl does optionally support the `is` prefix for property names
and standalone property values:

$ perl -Mutf8 -E 'say "π" =~ /\p{IsScript=Greek}/'
1
$ perl -Mutf8 -E 'say "π" =~ /\p{IsGreek}/'
1

However, this syntax is notoriously inconstant among different regex
engines. Perl’s specific rules are documented in *perluniprops* (
http://perldoc.perl.org/perluniprops.html) as \p{Is_*} (case- and
underscore-insensitive) being a synonym for \p{*} which explains the above
functionality. Based on my past research for *Unicode Regular Expression
Engines* at IUC38, I suspect that there might not be any regex engine that
actually supports syntax like Script=IsGreek as described in UAX44-LM3! If
anybody knows otherwise, I’d love to hear about it.

Nova
Received on Mon Jun 06 2016 - 16:39:40 CDT

This archive was generated by hypermail 2.2.0 : Mon Jun 06 2016 - 16:39:41 CDT