Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Mark Davis (
Date: Thu Mar 12 1998 - 20:47:24 EST

I agree that in the usual case, to be able to specify just "letter", "uppercase
letter", etc. accounts for a lot of usage, and can be locale-independent.

For ranges, one possibility is when the user specifies [a-d], the program
figures out the user's locale *at that time*, and translates that into the
appropriate locale-independent ranges:

e.g. the equivalent of [aAbBcCdD] for my locale, but excluding
a-ring and a-umlaut for a Swede's locale [aAbBcCdD].

That way, you end up with locale-independent ranges that can be exchanged.

Mark Edward Davis, Program Director,
IBM Centre for Java Technology SV
800 El Camino Real West, Mountain View, CA 94040
voice: (408) 777-5116, fax: (650) 335-2215, on 98/03/12 03:42:07 nm
Please respond to @ internet
To: @ internet
cc: @ internet, @ internet
Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)

Hallvard replied:

> I'm a UNIX programmer, but also a UNIX user.

Tautologous, I'm afraid. Any UNIX user who uses a regexp, is, as I
see it, by definition a UNIX programmer.

I'm not pooh-poohing the need for regexp's, or their usefulness
in making things happen on a UNIX system. But the set of people
who know more about regexp (on any platform) than the use of "*"
in directory listings is basically a disjunct set from the
"users" of interest to 98% of the world software market.

> > In my opinion, people should be thinking more generically about how to
> > extend and abstract the concepts of string pattern matching in the
> > context of the universal character set, rather than focussing on how
> > to "fix" regexp syntax per se for Unicode.
> What exactly do you have in mind?

For example, a mechanism which is widely surfaced in database applications
for pattern matching is provision of a pattern mask. E.g. using "L" to
stand for any letter, "N" for any digit, "?" for any character, etc.
By virtue of their simplicity, these are much easier to explain to users,
who can make useful choices with them *despite* the lack of power and

Or the LIKE clause in a typical SQL implementation, which uses a
very, very pared down regexp syntax ("%" for a match of zero or more
characters, "_" for a match of a single character, [abc] for a set,
[a-f] for a range, and "^" for exclusion from a set or range).

Or the Microsoft Word find expression syntax (essentially another
example of masking), which uses "^?" for any character, "^#" for any digit,
"^$" for any letter, etc., as well as format specifications which aren't
carried in the find string itself.

Or the Microsoft Excel find expression "wildcard" syntax -- even simpler, using
"?" to match a single character or "*" to match any string.

Not that any of these is "better" than general regexp, but they are
in widespread use and are more comprehensible to end users. (Even the
LIKE clause syntax is often hidden completely behind the UI of
an application that finds some other way to surface choices to the
user for narrowing down a match.)

Actually, what I had in mind was more along the lines of a serious,
holistic analysis of what string pattern matching means in a universal
character set context, accompanied by some thinking about how layering
abstractions for pattern matching could result in different levels
(implemented differently), depending on application needs.

> I don't think the syntax is
> important, but the power and compactness *is* important - and then you
> end up with more or less the same syntax.

Sometimes *lack* of power is important. A Turing machine is not the
proper engineering answer for provision of a device to turn a light
on or off.

> People are not going to write
> 20 lines of grammar or whatever if they could write two 10-character
> regexps.

Substitute "UNIX programmers" for "People" in that statement, and I
would agree!

> Or did you mean to base this alternative on something else
> than Deterministic Finite Automatons? If so, what?

As an engineer, I try to build what is needed. If, for an application,
a DFA is required, then I would build that. But the needs might be
greater or lesser, depending.

Sandra O'Donnell responded:

> Keep in mind that end users don't use regular expressions
> (unless forced to by user-vicious UI's) -- it is UNIX
> programmers who use regular expressions.
> I disagree. Programmers use regular expressions because
> that is the mechanism they have had to give users "logical"
> behavior. Just as users want to see lists in an order that
> makes sense to them, they often want to grab subsets of those
> lists -- and what a subset includes differs depending on
> the language they speak. Regular expressions have been an
> extremely common way to give users the varying subsets they want.

I certainly concur that regular expressions are a useful way
for programmers to implement logical behavior that then
gets surfaced to a user somehow. But whether regular expressions
per se are an "extremely common way to give users the varying
subsets they want," -- I dunno. I don't see any use in
Microsoft Office, in Browsers, or in Web search engines.
(Lycos finds "aardvark" just fine, but barfs on "aardv??k" or
"aardv*k".) And those I *would* consider extremely common applications.

> In my opinion, people should be thinking more generically
> about how to extend and abstract the concepts of
> string pattern matching in the context of the universal
> character set, rather than focussing on how to "fix"
> regexp syntax per se for Unicode.
> That's reasonable. Some concepts don't expand infinitely
> well. However, whatever replaces regexp still has to deal
> with users' varying expectations of what a given range
> includes. Users definitely should not have to be aware of
> how characters are encoded or whether they're using a large
> or small coded character set; their ranges should "just work."

I agree completely. This is one of the essential reasons for
rethinking the whole issue in terms of the Universal
Character Set. We need to divorce the problem of what it means
to specify a range from the particular encoding contingencies.
A range should be defined *on* Unicode *for* a particular
collation. How that range is implemented for
software running in some other encoding than Unicode is
then a separate issue. (My tentative answer would be to
implement a Unicode engine inside and convert expressions
to Unicode before evaluation -- the same answer that parsers
in general should take.)

Of course, that then raises the issue of what it means to
define a range *for* a particular collation. This is going
to be very messy, since the concept of a mapping between one
collation and another is very complex -- much more complex
than the concept of a mapping between one character set
encoding and another. But the work on ISO 14651 and the
Default Unicode Collation work is starting to address
these issues.


> -----------------------
> Sandra Martin O'Donnell

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT