Re: \p{} and \g{} in regexp

From: Mark Davis (
Date: Tue Jul 23 2002 - 13:43:49 EDT

1. Here is my take, if you are trying to slim down:

> canonical combining class

This is really useful for matching. For example, if my source text is
NFD and I want to recognize whatever is canonically equivalent to
a-ring (with perhaps other accents), then I have to use something like
the following (syntax may vary, and I throw in some variables for

$nonAboveAccent = [\p{ccc!=0}\p{ccc!=above}] ;
$ringAbove = \u030A ;

$pattern = [aA] $nonAboveAccent* $ringAbove ;

Note that it is more complicated to do the same thing in NFC; it looks
something like:

$aWithSomeRing = [aA\u01FA\u01FB] //
$aWithWithNonAboves = [ÀÁÂÃÄàáâãäĀā....] ; // many more

$pattern = ($aWithSomeRing | $aWithWithNonAboves $nonAboveAccent*
$ringAbove ) ;

> bidi class ?
> east asian width
> arabic joining type ?
> arabic joining group ?
> line break

These are all similar, and have to do with eventual appearance on the
screen. If you are trying to match expressions based on one of these
eventual display features, then they are useful; otherwise, they
aren't particularly.

> all binary properties

Of these, the Other_X are only contributory, and can be omitted.
The Expands_On_X are not, in my opinion, particularly useful, and
mainly included for historical reasons.

You may want to look at the ICU property support for 2.2 (in beta)
just for comparison. See
html. The items marked "U" are included in UnicodeSet which
corresponds to [...] in regular-expression engines.

2. I'll give a more concrete example: [\p{script=latin}\g{aa} &

The way we interpret this is that [\p{lowercase}] is a set of code
points, as you do. I wouldn't try anything fancier. When you AND them
with a set of code points and strings, you end up with just code
points. That is well defined; you just need to caution the user that
it will exclude strings, such as "aa". If a user wanted to do a
broader match, s/he would write something like:

$lowercaseLetterString = (\p{lowercase} \p{gc=non-spacing mark}*)* ;

[\p{script=latin}\g{aa}] & $lowercaseLetterString

◄ “Eppur si muove” ►

----- Original Message -----
From: "Theo Veenker" <>
To: <>
Sent: Tuesday, July 23, 2002 00:18
Subject: \p{} and \g{} in regexp

> Hi,
> I have a few questions regarding unicode regular expressions.
> 1) I'm working on a regexp matcher and I'd like to know which
> are never needed in a \p{...} item. Currently I have included the
> listed below, but for efficiency reasons I'd like to trough out what
> really necessary:
> general category
> bidi class ?
> canonical combining class ?
> decomposition type
> line break
> east asian width
> arabic joining type ?
> arabic joining group ?
> script name
> block name
> age
> numeric type
> all binary properties
> So can anyone tell me if the marked properties are really usefull in
> a \p{...} item?
> 2) About grapheme clusters in a bracketed expression. It is clear
what is
> meant by an expression like [a-z\g{aa}]. But how do I interprete
> like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in
> a-z or grapheme cluster aa, provided it has the foo property. The
> is that \p{...} only applies to single code points, not to grapheme
> I can do three things:
> 1. try if NFC of characters in \g{...} yields a single character
> work with that, otherwise fail
> 2. only test first (base) character of the cluster
> 3. don't allow use of operators & and - (i.e. &^) in a bracketed
> expression in which one or more \g{...} are used
> What would be the most appropiate thing to do?
> Regards,
> Theo

This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 11:55:36 EDT