Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Hallvard B Furuseth (
Date: Thu Mar 12 1998 - 16:32:01 EST

Kenneth Whistler writes:

about [:alpha:]:
> What the specification doesn't say is that this mechanism
> too, is locale-specific

Ah, yes. *That's* why I keep using ranges. 8-bit ranges are not
locale-dependent, and totally configurable (since I do them completely
myself) -- and until Unicode arrived, multi-charset or multi-byte stuff
was not a relevant problem for me. Sorry that I forgot to mention this;
I never think about isalpha & co in connection with non-ASCII chars:-)

> "Range expressions must not be used in portable applications
> because their behaviour is dependent on the collating
> sequence.

So we need is a user-specified collating sequence (or programmer-
specified, depending on how you see it. I'm thinking of programmable
applications, after all). It could give an error if we try to sort
characters outside the known sequence. That should be portable. That's
what I have given up wishing for in regexps (so I could get rid of most
of my ranges). Some UNIX platforms provide user-specified locales, but
that's even less portable and less documented than locales themselves.

(Or perhaps it's time for me to reread ISO 14651 - 14652 before I
reinvent too much of them?)

> Keep in mind that end users don't use regular expressions
> (unless forced to by user-vicious UI's) -- it is UNIX
> programmers who use regular expressions.

UNIX is full of user-vicious and/or beginner-vicious UIs. However, that
has nothing to do with regexps as such. They are simple and powerful
compared to a lot of features in "user-friendly" (or more correctly,
beginner-friendly) programs like Word. It's a pity if regexps are
considered user-vicious just because they come from user-vicious UNIX.

I'm a UNIX programmer, but also a UNIX user. I'm using regexps
constantly *as a user* -- when searching files, in Emacs, in kill files.
And you are right, I haven't seen regexps on Macintosh, if that's what
you consider user-friendly -- nor a lot of other things I take for
granted on UNIX -- but that's one of the reasons I seldom use Macintosh.

> There is then an enormous house of cards of programs and tools built
> up on the basis of regexp pattern matching. The foundations of that
> house of cards are rotten, however, and the house will not stand when
> 38,000 characters try to move in.

You may be right. I hope not.

> In my opinion, people should be thinking more generically about how to
> extend and abstract the concepts of string pattern matching in the
> context of the universal character set, rather than focussing on how
> to "fix" regexp syntax per se for Unicode.

What exactly do you have in mind? I don't think the syntax is
important, but the power and compactness *is* important - and then you
end up with more or less the same syntax. People are not going to write
20 lines of grammar or whatever if they could write two 10-character
regexps. Or did you mean to base this alternative on something else
than Deterministic Finite Automatons? If so, what? *Is* there a simple
and powerful alternative?


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT