\p{} and \g{} in regexp

From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Tue Jul 23 2002 - 03:18:25 EDT


Hi,

I have a few questions regarding unicode regular expressions.

1) I'm working on a regexp matcher and I'd like to know which properties
are never needed in a \p{...} item. Currently I have included the properties
listed below, but for efficiency reasons I'd like to trough out what isn't
really necessary:

  general category
  bidi class ?
  canonical combining class ?
  decomposition type
  line break
  east asian width
  arabic joining type ?
  arabic joining group ?
  script name
  block name
  age
  numeric type
  all binary properties

So can anyone tell me if the marked properties are really usefull in
a \p{...} item?

2) About grapheme clusters in a bracketed expression. It is clear what is
meant by an expression like [a-z\g{aa}]. But how do I interprete something
like [a-z\g{aa} & \p{foo}]. This reads as: accept any character in range
a-z or grapheme cluster aa, provided it has the foo property. The problem
is that \p{...} only applies to single code points, not to grapheme clusters.

I can do three things:
  1. try if NFC of characters in \g{...} yields a single character and
     work with that, otherwise fail
  2. only test first (base) character of the cluster
  3. don't allow use of operators & and - (i.e. &^) in a bracketed
     expression in which one or more \g{...} are used

What would be the most appropiate thing to do?

Regards,
Theo



This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 01:32:53 EDT