Re: String Ranges in Unicode Sets

From: Richard Wordingham <>
Date: Mon, 7 Sep 2015 20:46:06 +0100

On Mon, 7 Sep 2015 16:54:16 +0200
Mark Davis ☕️ <> wrote:

> On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham <
>> wrote:

>> By my reading, adding string ranges will initially make regular
>> expression engines that don't use ICU non-compliant with Level 1 of
>> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction
>> and

> I don't see where you are getting that. UTS 35 isn't referenced by
> UTS 18 except for some examples of possible extensions in 1.2.3 Other
> Properties, and locale id syntax in level 3. I may be missing
> something, however. Can you tell me where #18 is referencing
> UnicodeSet?

In ,
you stated that the Unicode sets referred to in UTS#18 RL1.3 are the
Unicode sets defined in UTS #35. We are now waiting for you to add the
reference under Action 141-A76 - 'Make changes in UTS #18 based on
general feedback in
L2/14-277' (
I presume no change has been made yet because there are no *urgent*
changes for UTS #18.

> String ranges need not be implemented internally (and I don't think
> the CLDR committee would expect them to be, in general). They are
> simply a way of expressing the *string format* of a UnicodeSet in a
> more compact fashion. (And UnicodeSets themselves can have a variety
> of different implementations, in any event).

[\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a
very compact way of expressing a lot of strings. You wouldn't
decompose that into a list of strings.

>> String ​ ​
>> ranges seem particularly vulnerable to the ill-effects of
>> unpredictable

> UnicodeSets are low level constructs, as are their string
> representations. Like all strings, the string format of a UnicodeSet
> may change if it is normalized. That is nothing new.

> - The string format "[a-Ω]" (that is, U+0061 LATIN SMALL LETTER A
> through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390
> code points.
> - Under NFC it would change to "[a-Ω]" (that is, U+0061 LATIN
> contain 841 code points.

At least this gives the same range whether normalised to NFC or to
NFD. Using NFD, the preferred normalisation for regular
expressions semi-respecting canonical equivalence, [{x̀}-{ẍ}] would
not include the 2-character string "xa", as both bounds would decompose
to two characters. Using NFC, the preferred normalisation for LDML
(and for XML, I think), this would be a contraction for [{x̀}-{xẍ}],
and would include the 2-character string "xa". If the two strings had
to have the same length, [{x̀}-{ẍ}] would be flagged as erroneous if
interpreted in NFC, and with any luck, similar errors that were not
detected would then also be corrected. It's not perfect, but il meglio
è l’inimico del bene.

> You really don't want to normalize the string format of UnicodeSets.
> Or if you suspect that those string formats might be normalized, then
> just use escaped format \x{...} for anything that might change under
> normalization.

It would probably be sensible to issue a warning if the specification
of a string bound had more than one canonical equivalent.

I'm thinking of accidents. While an XML processor must not be Unicode
compliant, I thought most regular expression engine environments were
allowed to be Unicode compliant.

TUS 8.0 Chapter 3 C6: "A process shall not assume that the
interpretations of two canonical-equivalent character sequences are

> Note that while it is fine to bring up topics for discussion here (or,
> better yet, on the "" <>
> list),

As this impacts regular expressions in general, I think this is the
better list for the impact on Unicode sets outside CLDR.

> anything that requires a change will have to be filed as a
> CLDR ticket. Richard, I'm sure you know this, and also raised this
> topic here because of the relation to UTS18, so this is a reminder
> for others.


Received on Mon Sep 07 2015 - 14:47:37 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 07 2015 - 14:47:37 CDT