Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Mark Davis (medavis2@us.ibm.com)
Date: Wed Mar 18 1998 - 16:31:01 EST


I contacted them, and will be trying to set up a DL for this topic. I have a
few more thoughts on the issues gathered from some other material. The problems
are basically the following:

1. Types
Since it is a large character set, you want to extend the "types" of characters
that you can match against--for example, so that "digit" maps to any of the
Unicode digits, etc. The basic Unicode character categories consists of:
Letters, Punctuation, Symbols, Marks, Numbers, Separators, Other. Each of these
has different subcategories: for Letter, you have Uppercase, Lowercase,
Titlecase, Modifier, Other (in this case, "Other" includes uncased letters such
as Chinese). There are additional character properties that would be useful to
add.

A regular-expression mechanism may not choose to offer all of these, but it
should be extended to more than just the ones applicable to English. If a
regular expression could be expressed as much as possible in terms of
higher-level semantic constructs (such as "vowel" instead of [aAeEiIoOuU], it
makes it easier to work with different alphabets and languages.

Some of the Unicode categories are absolute (normative), while some may have
slight variations according to the locale. For example, Turkish considers I
(0049) and dotless-i (0131) to be case variants.

See ftp://ftp.unicode.org/Public/UNIDATA/README.TXT and Chapter 4 of The
Unicode Standard (TUS) for more info.

2. Ranges
The order of Unicode characters may differ substantially from the order
expected by users of a particular language. One has to decide, for example,
whether the list [a-] means:
- the Unicode characters between 0061 and 00E5 (including 'z', 'Z', '[', and
'') or
- the lowercase letters between them, or
- the letters in that order in the users' locale (which does not include 'z' in
English, but does include it in Swedish).

3. Boundaries
One or more Unicode characters may make up what the user thinks of as a
character (call this a "user character"). For example, "G" + acute-accent may
be thought of as a single character by users, yet is actually represented by
two Unicode characters. The general notion of the boundaries between characters
is fairly easy to describe.

What constitutes a boundary between words is trickier, since it is dependent
not just on the adjacent characters. You can have a general algorithm that
takes care of most of the world for character and word boundaries. Note: word
boundaries and line-break boundaries are not, in general, the same.

See Section 5-13 of TUS for more info.

4. Canonical Equivalencies
There are many instances where a user character can be equivalently expressed
by two different sequences of Unicode characters.

See Section 2-5 and 3.9 of TUS for more info.

4. For cases 1-3 above, there are two choices:
a. locale-independent: the same regular expression always produces the same
result.
b. locale-dependent: the results match more closely to a "user's" expectations,
such as for [a-] in Danish.

My opinion is that the base regular-expression mechanism should be
locale-independent; for performance, simplicity, and reliability. A general
specification can handle most of the world well without modification.

If someone wants particular modules that translate particular locale
requirements into a locale-independent form, that could be done. Thus, for
example, something like MakeRegex could be used to make a range of letters from
a particular alphabet, or transform a regular expression that is written in
terms of higher-level locale-independent constructs (such as "vowel") into one
that has variations for a particular language. For example, one could image a
MakeRange('a','') that produced the list [a] in German, and [a-z] in
Swedish.

Mark

_____________
Mark Edward Davis, Program Director,
IBM Centre for Java Technology SV
800 El Camino Real West, Mountain View, CA 94040
voice: (408) 777-5116, fax: (650) 335-2215
mailto:medavis2@us.ibm.com, mailto:president@unicode.org

======================
gperkins@netcom.com on 98/03/18 02:17:46
Please respond to unicode@unicode.org @ internet
To: unicode@unicode.org @ internet
cc:
Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)

I missed most of this Unicode regex discussion since I was down at Internet
World in Los Angeles, so I hope you'll forgive me for bringing it back up.

Perl is probably the gold standard for regex usage these days. Those who
produce commercial regex libraries for languages such as C++ and Java
usually base their regexes on Perl 5, since that's what regex users demand.

The Perl people and the XML people just concluded a summit in which they
agreed that Perl was the ideal XML parsing language--except for the glaring
problem that XML is Unicode/10646-based and Perl is fundamentally "char ==
byte" based.

As a result, Larry Wall (Perl's creator) has agreed that upgrading Perl to
"Unicode compatibility" is his highest priority. Since Perl has regexes
built right into the syntax of the language itself, this will require him to
somehow implement a solution to regexes for Unicode. Once he has done so, it
is not unreasonable to assume that a lot of other developers will follow
suit and implement his regex solutions in their own software and in various
commercial class libraries.

Larry is a very bright guy, and has a linguistics degree, but I can't help
but think that the whole world would benefit from a bit of collaboration
between him and some of you on the Unicode/Unicore mailing lists. Getting it
right in Perl now will probably save a lot of aggravation in the future.

I encourage anyone who is interested to check out:

http://www.perl.com/perl-xml.html

and go ahead and contact Larry. He's working at O'Reilly right now, so he's
probably something like lwall@ora.com, or something similar.

For what it's worth, here's my $0.02:

I think that all matches, ranges, and sorts should accept explicit arguments
that specify exactly what matches what, what series of chars is being
subsetted by a range, and what sorts before what. This way, all platform
dependence can be eliminated from the code. I also think that an extra layer
of abstraction is needed. Rather than comparing bits directly, you would
convert strings to a canonical form before attempting a match.

On some machines, a given argument (such as a "use ISO 8859-1 for all
ranges" statement) would be compiled/interpreted right out of the code,
because it referred to the native encoding of the machine and could be
implemented in the DFA/NFA directly. The "canonical form" in many such cases
would just be the unconverted data. Matching Latin-1 search strings to
Latin-1 data shouldn't involve converting both to Unicode.

On other machines, that encoding/matching/range/collating argument might
have to be backed by a subroutine that could be either a standard feature,
customized by the programmer, or implemented by the programmer from scratch.
Such subroutines would have a standard interface, but no specified
implementation. Whether for matching (both char encoding conversion and char
equivalence info), ranges (series of chars in desired order from which
ranges/subsets are taken), or sorting (collation rules), they could either
be implemented via table lookup, or they could be algorithmic. This would
allow the programmer to either use pre-built subroutines representing
various standards, or override the subroutine with his own custom version.
It would also mean that a smart compiler/interpreter would be able to
completely optimize out the subroutines in many cases, making Unicode
regexes just as fast as non-Unicode regexes at doing plain 'ol Latin-1 sorts
and matches on a plain 'ol Latin-1 machine, while still allowing decomposed
Hangul in one encoding to be used as a search string for pre-composed Hangul
in another encoding--albeit more slowly.

I suppose that if the arguments are not specified explicitly, the regexes'
behavior could default to locale-specific behavior, but I'm not fond of code
that behaves differently on different platforms. It's asking for trouble.
I'd rather the defaults be fixed, but that can get pretty contentious. If
the arguments *are* explicit, though, the same code shouldn't function
differently in different locales. It should just have more efficient or less
efficient implementations, depending on how closely the underlying OS
matched the operations required by the code.

Regarding Ken's suggestion of more user-friendly regex approaches, I'm all
in favor of user-friendly regex-building "wizards". You could allow the user
to see the resulting regex, even allowing them to edit it directly if
desired, or keep it hidden internally. I do think it's likely, though, that
there will be enough regex library code out there that most programmers will
want to use regexes internally for powerful searches, regardless of the
interface shown to the user.

__Glen Perkins__



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT