Re: String Ranges in Unicode Sets

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Mon, 7 Sep 2015 15:11:44 -0700
On 9/6/2015 11:23 PM, Richard Wordingham wrote:
On Thu, 03 Sep 2015 09:32:42 -0700
Rick McGowan <rick@unicode.org> wrote:

A proposed update to the LDML specification (UTS #35) will be
available for review as of Monday, September 7 at 06:00 GMT. The open
review period closes on Monday, September 14 at 06:00 GMT. (This is a
short review period, because CLDR 28 is scheduled for release in the
week of September 16.)

The proposed update will be at
http://unicode.org/reports/tr35/proposed.html

To report bugs in the specification, please use 
http://unicode.org/cldr/trac/newticket

Have the implications of adding string ranges to Unicode sets been
considered?  I'm mentioning them on the list because their impact goes
beyond locales, and I haven't worked out their implications myself.

By my reading, adding string ranges will initially make regular
expression engines that don't use ICU non-compliant with Level 1 of
UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and
intersection'.  I don't imagine the extra work of set operations on
Unicode sets containing string ranges will be popular.  It may be worst
for the minority of regular expression engines that use the regularity
of regular expressions.

I note that the safety feature of requiring the start and end points
to have the same length has been removed from their design.  

The restriction appears to have weakened to the point where the left string is allowed to be longer, and where the "excess" is then understood as a common prefix. On the face of it, that seems a mere convenience.


String
ranges seem particularly vulnerable to the ill-effects of unpredictable
normalisation.

If a String range is, as claimed, merely a more compact statement of what can be done with existing sets and patterns, this should be made explicit, by giving the rewrite rules.

That would answer two of your issues.

1) a preprocessor can be used to change range expressions into expressions that work with older engines
2) the normalization issues are no worse than for other sets

There may be the issue of how these play with operations on the sets themselves, like union intersection and difference.

These cases should be covered by the required rewrite rules to make it verifiable that the ranges are simply syntactic sugar and do not have hidden new functionality.

A./

Richard.


Received on Mon Sep 07 2015 - 17:13:06 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 07 2015 - 17:13:06 CDT