Re: String Ranges in Unicode Sets

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Mon, 7 Sep 2015 16:54:16 +0200

Thanks for the feedback.

>By my reading, adding string ranges will initially make regular
expression engines that don't use ICU non-compliant with Level 1 of
UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and

I don't see where you are getting that. UTS 35 isn't referenced by UTS 18
except for some examples of possible extensions in 1.2.3 Other Properties,
and locale id syntax in level 3. I may be missing something, however. Can
you tell me where #18 is referencing UnicodeSet?

> I don't imagine the extra work of set operations

String ranges need not be implemented internally (and I don't think the
CLDR committee would expect them to be, in general). They are simply a way
of expressing the *string format* of a UnicodeSet in a more compact
fashion. (And UnicodeSets themselves can have a variety of different
implementations, in any event).

​> ​
String
​ ​
ranges seem particularly vulnerable to the ill-effects of unpredictable

UnicodeSets are low level constructs, as are their string representations.
Like all strings, the string format of a UnicodeSet may change if it is
normalized. That is nothing new.

   - The string format "[a-Ω]" (that is, U+0061 LATIN SMALL LETTER A through
   U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390 code points.
   - Under NFC it would change to "[a-Ω]" (that is, U+0061 LATIN SMALL
   LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and contain 841
   code points.

You really don't want to normalize the string format of UnicodeSets. Or if
you suspect that those string formats might be normalized, then just use
escaped format \x{...} for anything that might change under normalization.

===

Note that while it is fine to bring up topics for discussion here (or,
better yet, on the "cldr-users_at_unicode.org" <cldr-users_at_unicode.org> list),
anything that requires a change will have to be filed as a CLDR ticket.
Richard, I'm sure you know this, and also raised this topic here because of
the relation to UTS18, so this is a reminder for others.

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Thu, 03 Sep 2015 09:32:42 -0700
> Rick McGowan <rick_at_unicode.org> wrote:
>
> > A proposed update to the LDML specification (UTS #35) will be
> > available for review as of Monday, September 7 at 06:00 GMT. The open
> > review period closes on Monday, September 14 at 06:00 GMT. (This is a
> > short review period, because CLDR 28 is scheduled for release in the
> > week of September 16.)
> >
> > The proposed update will be at
> > http://unicode.org/reports/tr35/proposed.html
> >
> > To report bugs in the specification, please use
> > http://unicode.org/cldr/trac/newticket
> >
>
> Have the implications of adding string ranges to Unicode sets been
> considered? I'm mentioning them on the list because their impact goes
> beyond locales, and I haven't worked out their implications myself.
>
> By my reading, adding string ranges will initially make regular
> expression engines that don't use ICU non-compliant with Level 1 of
> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and
> intersection'. I don't imagine the extra work of set operations on
> Unicode sets containing string ranges will be popular. It may be worst
> for the minority of regular expression engines that use the regularity
> of regular expressions.
>
> I note that the safety feature of requiring the start and end points
> to have the same length has been removed from their design. String
> ranges seem particularly vulnerable to the ill-effects of unpredictable
> normalisation.
>
> Richard.
>
Received on Mon Sep 07 2015 - 09:56:06 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 07 2015 - 09:56:07 CDT