Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 12 1998 - 14:03:48 EST


And one of the fundamental issues addressed squarely by
the Single UNIX Specification for Regular Expressions,
in the document pointed to by Xiao-He, is the dependence
of regexp range expressions of locale-specific collations:

"Range expressions must not be used in portable applications
because their behaviour is dependent on the collating
sequence. Ranges will be treated according to the current
collating sequence, and include such characters that fall
within the range based on that collating sequence, regardless
of character values. This, however, means that the
interpretation will differ depending on collating sequence."

The specification also defines the "character class
expressions" pointed to by Gary Miller, e.g. [:alpha:],
etc. In general,

"In addition character class expressions of the form [:name:]
are recognized in those locales where the *name* keyword has
been given a charclass definition in the LC_CTYPE category."

What the specification doesn't say is that this mechanism,
too, is locale-specific, and hence not safe for portable
applications, since the LC_CTYPE category can also be redefined
on a per-locale basis.

Extending regular expression syntax to the universal
character set opens up an enormous can of worms. In
my opinion, no amount of jiggering the locale model and
the regexp syntax is really going to "solve" this problem.
And while ISO 14651 and ISO 14652 are heroic attempts to
extend the model to take UCS into account, they have
gaping flaws as yet.

Keep in mind that end users don't use regular expressions
(unless forced to by user-vicious UI's) -- it is UNIX
programmers who use regular expressions. There is then
an enormous house of cards of programs and tools built
up on the basis of regexp pattern matching. The
foundations of that house of cards are rotten, however,
and the house will not stand when 38,000 characters
try to move in.

In my opinion, people should be thinking more generically
about how to extend and abstract the concepts of
string pattern matching in the context of the universal
character set, rather than focussing on how to "fix"
regexp syntax per se for Unicode.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT