Re: String Ranges in Unicode Sets

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 8 Sep 2015 23:01:35 +0100

On Tue, 8 Sep 2015 13:46:48 +0200
Mark Davis ☕️ <mark_at_macchiato.com> wrote:

> On Tue, Sep 8, 2015 at 9:53 AM, Asmus Freytag (t)
> <asmus-inc_at_ix.netcom.com> wrote:

<snip>

> > What about set operations on sets with string ranges?

> ​Again, the range notation is just a formatting issue. Anything you
> can do with [{ax}-{bz}​] you can also do with
> [{ax}{ay}{az}{bx}{by}{bz}​], and vice versa, since the former is
> defined to be equivalent to the latter. These are just string
> representations of the same *logical* underlying implementation.
 
> > Can they be expressed (other than working them out and writing down
> > the full enumeration of the resulting set)?

> I'm not quite sure what you mean. That's like asking, "Can [a-z] be
> expressed, ​other than by writing out the full enumeration [a b c d
> e ... z]?". Well, yes. You could represent [a-z] in many ways:
> [\p{ASCII}&\p{lu}], for example. Or [\u0061 \u0062 ...]. Or....

> ​But I'm probably misunderstanding what you are trying to say.​

I think Asmus is asking if there is a more compact representation of
the result of a string operation than just listing all the string
elements. The answer would then be yes. Just [a-z]~~[e-s] can be
written (and represented internally) as [a-dt-z], so
[{aa}-{zz}]-[{ee}-{ss}] can be written (and represented internally) as
the union of four non-overlapping string ranges [{aa}-{dz} {ea}-{sd}
{et}-{sz} {ta}-{tz}]. Fortunately, unions of string ranges of the same
length commute, which is not necessarily the case for Unicode sets.
(It is possible that [[a][{ab}]] might preferentially match "a" while
[[{ab}][a]] preferentially matched "ab".)

Richard.
Received on Tue Sep 08 2015 - 17:02:48 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 08 2015 - 17:02:48 CDT