Re: Unicode Sets in 'Unicode Regular Expressions'

From: Richard Wordingham <>
Date: Wed, 28 May 2014 01:19:26 +0100

On Wed, 28 May 2014 00:56:40 +0200
Charlie Ruland ☘ <> wrote:

> So I take “Unicode set” to mean “set of Unicode characters” with
> their respective codepoints, whether decomposable or not.

The decomposability issue arises when trying to follow RL2.1
"Canonical Equivalence". In a pattern such as "f\p{L}te".
\p{L} is not just a set of codepoints if the pattern is to be matched
by "fête" when processing NFD strings. This is one reason I think Ken
is right when he says the ICU meaning is intended. I believe I have a
coherent resolution of RL2.1, but I'm currently wrestling with the
other requirements that an implementation satisfying the spirit of
RL2.1 ought to address.


Unicode mailing list
Received on Tue May 27 2014 - 19:20:29 CDT

This archive was generated by hypermail 2.2.0 : Tue May 27 2014 - 19:20:29 CDT