String Ranges in Unicode Sets

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 7 Sep 2015 07:23:21 +0100

On Thu, 03 Sep 2015 09:32:42 -0700
Rick McGowan <rick_at_unicode.org> wrote:

> A proposed update to the LDML specification (UTS #35) will be
> available for review as of Monday, September 7 at 06:00 GMT. The open
> review period closes on Monday, September 14 at 06:00 GMT. (This is a
> short review period, because CLDR 28 is scheduled for release in the
> week of September 16.)
>
> The proposed update will be at
> http://unicode.org/reports/tr35/proposed.html
>
> To report bugs in the specification, please use
> http://unicode.org/cldr/trac/newticket
>

Have the implications of adding string ranges to Unicode sets been
considered? I'm mentioning them on the list because their impact goes
beyond locales, and I haven't worked out their implications myself.

By my reading, adding string ranges will initially make regular
expression engines that don't use ICU non-compliant with Level 1 of
UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and
intersection'. I don't imagine the extra work of set operations on
Unicode sets containing string ranges will be popular. It may be worst
for the minority of regular expression engines that use the regularity
of regular expressions.

I note that the safety feature of requiring the start and end points
to have the same length has been removed from their design. String
ranges seem particularly vulnerable to the ill-effects of unpredictable
normalisation.

Richard.
Received on Mon Sep 07 2015 - 01:25:00 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 07 2015 - 01:25:00 CDT