Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Glen Perkins (gperkins@netcom.com)
Date: Wed Mar 18 1998 - 04:45:00 EST


I missed most of this Unicode regex discussion since I was down at Internet
World in Los Angeles, so I hope you'll forgive me for bringing it back up.

Perl is probably the gold standard for regex usage these days. Those who
produce commercial regex libraries for languages such as C++ and Java
usually base their regexes on Perl 5, since that's what regex users demand.

The Perl people and the XML people just concluded a summit in which they
agreed that Perl was the ideal XML parsing language--except for the glaring
problem that XML is Unicode/10646-based and Perl is fundamentally "char ==
byte" based.

As a result, Larry Wall (Perl's creator) has agreed that upgrading Perl to
"Unicode compatibility" is his highest priority. Since Perl has regexes
built right into the syntax of the language itself, this will require him to
somehow implement a solution to regexes for Unicode. Once he has done so, it
is not unreasonable to assume that a lot of other developers will follow
suit and implement his regex solutions in their own software and in various
commercial class libraries.

Larry is a very bright guy, and has a linguistics degree, but I can't help
but think that the whole world would benefit from a bit of collaboration
between him and some of you on the Unicode/Unicore mailing lists. Getting it
right in Perl now will probably save a lot of aggravation in the future.

I encourage anyone who is interested to check out:

http://www.perl.com/perl-xml.html

and go ahead and contact Larry. He's working at O'Reilly right now, so he's
probably something like lwall@ora.com, or something similar.

For what it's worth, here's my $0.02:

I think that all matches, ranges, and sorts should accept explicit arguments
that specify exactly what matches what, what series of chars is being
subsetted by a range, and what sorts before what. This way, all platform
dependence can be eliminated from the code. I also think that an extra layer
of abstraction is needed. Rather than comparing bits directly, you would
convert strings to a canonical form before attempting a match.

On some machines, a given argument (such as a "use ISO 8859-1 for all
ranges" statement) would be compiled/interpreted right out of the code,
because it referred to the native encoding of the machine and could be
implemented in the DFA/NFA directly. The "canonical form" in many such cases
would just be the unconverted data. Matching Latin-1 search strings to
Latin-1 data shouldn't involve converting both to Unicode.

On other machines, that encoding/matching/range/collating argument might
have to be backed by a subroutine that could be either a standard feature,
customized by the programmer, or implemented by the programmer from scratch.
Such subroutines would have a standard interface, but no specified
implementation. Whether for matching (both char encoding conversion and char
equivalence info), ranges (series of chars in desired order from which
ranges/subsets are taken), or sorting (collation rules), they could either
be implemented via table lookup, or they could be algorithmic. This would
allow the programmer to either use pre-built subroutines representing
various standards, or override the subroutine with his own custom version.
It would also mean that a smart compiler/interpreter would be able to
completely optimize out the subroutines in many cases, making Unicode
regexes just as fast as non-Unicode regexes at doing plain 'ol Latin-1 sorts
and matches on a plain 'ol Latin-1 machine, while still allowing decomposed
Hangul in one encoding to be used as a search string for pre-composed Hangul
in another encoding--albeit more slowly.

I suppose that if the arguments are not specified explicitly, the regexes'
behavior could default to locale-specific behavior, but I'm not fond of code
that behaves differently on different platforms. It's asking for trouble.
I'd rather the defaults be fixed, but that can get pretty contentious. If
the arguments *are* explicit, though, the same code shouldn't function
differently in different locales. It should just have more efficient or less
efficient implementations, depending on how closely the underlying OS
matched the operations required by the code.

Regarding Ken's suggestion of more user-friendly regex approaches, I'm all
in favor of user-friendly regex-building "wizards". You could allow the user
to see the resulting regex, even allowing them to edit it directly if
desired, or keep it hidden internally. I do think it's likely, though, that
there will be enough regex library code out there that most programmers will
want to use regexes internally for powerful searches, regardless of the
interface shown to the user.

__Glen Perkins__



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT