Glen, are you the person who used to work at BitStream?
Are you the person who hired me as a Japanese teacher?
If I'm wrong, please never mind.
From: Glen Perkins [SMTP:email@example.com]
Sent: Thursday, March 19, 1998 12:19 PM
To: Unicode List
Subject: Re: Regular expressions in Unicode
I'm very pleased to hear that you've contacted them. If I'm correct,
what's done in Perl will serve as the model for most Unicode-savvy
in the future, then getting to Larry Wall right now should save a
wailing and gnashing of teeth in the future.
Perl already has the ability to specify the rules for a sort. If you
@sortedListOfStrings = sort Danish @unsortedListOfStrings;
and then you create a subroutine called "Danish" that follows a
API for sort routines and encapsulates Danish collation rules, it
You can then use the word "Danish" as a modifier whenever you use
keyword "sort" and get the desired result.
This feature wasn't created for i18n purposes. It was primarily
specify the field in a particular record that was to serve as the
and whether it was to be sorted numerically or alphabetically. It
happens that once you have that capability, you can use it to
specific set of collation rules for i18n work.
If Larry is given a clear, reliable, authoritative checklist of
needed, it's likely that he'll be able to add that same ability (to
a rule set as an argument) to matching, ranges, types, and so forth.
Once he has done so, everyone else who implements a regex package
programming language is likely to use Perl as the model (as they
tend to do) and implement the same features.
Getting that checklist to him early is the key, IMO.
From: Mark Davis <firstname.lastname@example.org>
To: Unicode List <email@example.com>
Date: Wednesday, March 18, 1998 1:57 PM
Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
I contacted them, and will be trying to set up a DL for this topic.
I have a
few more thoughts on the issues gathered from some other material.
are basically the following:
Since it is a large character set, you want to extend the "types" of
that you can match against--for example, so that "digit" maps to any
Unicode digits, etc. The basic Unicode character categories consists
Letters, Punctuation, Symbols, Marks, Numbers, Separators, Other.
has different subcategories: for Letter, you have Uppercase,
Titlecase, Modifier, Other (in this case, "Other" includes uncased
as Chinese). There are additional character properties that would be
A regular-expression mechanism may not choose to offer all of these,
should be extended to more than just the ones applicable to English.
regular expression could be expressed as much as possible in terms
higher-level semantic constructs (such as "vowel" instead of
makes it easier to work with different alphabets and languages.
Some of the Unicode categories are absolute (normative), while some
slight variations according to the locale. For example, Turkish
(0049) and dotless-i (0131) to be case variants.
See ftp://ftp.unicode.org/Public/UNIDATA/README.TXT and Chapter 4 of
Unicode Standard (TUS) for more info.
The order of Unicode characters may differ substantially from the
expected by users of a particular language. One has to decide, for
whether the list [a-ä] means:
- the Unicode characters between 0061 and 00E5 (including 'z', 'Z',
- the lowercase letters between them, or
- the letters in that order in the users' locale (which does not
English, but does include it in Swedish).
One or more Unicode characters may make up what the user thinks of
character (call this a "user character"). For example, "G" +
be thought of as a single character by users, yet is actually
two Unicode characters. The general notion of the boundaries between
is fairly easy to describe.
What constitutes a boundary between words is trickier, since it is
not just on the adjacent characters. You can have a general
takes care of most of the world for character and word boundaries.
boundaries and line-break boundaries are not, in general, the same.
See Section 5-13 of TUS for more info.
4. Canonical Equivalencies
There are many instances where a user character can be equivalently
by two different sequences of Unicode characters.
See Section 2-5 and 3.9 of TUS for more info.
4. For cases 1-3 above, there are two choices:
a. locale-independent: the same regular expression always produces
b. locale-dependent: the results match more closely to a "user's"
such as for [a-ä] in Danish.
My opinion is that the base regular-expression mechanism should be
locale-independent; for performance, simplicity, and reliability. A
specification can handle most of the world well without
If someone wants particular modules that translate particular locale
requirements into a locale-independent form, that could be done.
example, something like MakeRegex could be used to make a range of
a particular alphabet, or transform a regular expression that is
terms of higher-level locale-independent constructs (such as
that has variations for a particular language. For example, one
MakeRange('a','ä') that produced the list [aä] in German, and
Mark Edward Davis, Program Director,
IBM Centre for Java Technology SV
800 El Camino Real West, Mountain View, CA 94040
voice: (408) 777-5116, fax: (650) 335-2215
firstname.lastname@example.org on 98/03/18 02:17:46
Please respond to email@example.com @ internet
To: firstname.lastname@example.org @ internet
Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
I missed most of this Unicode regex discussion since I was down at
World in Los Angeles, so I hope you'll forgive me for bringing it
Perl is probably the gold standard for regex usage these days. Those
produce commercial regex libraries for languages such as C++ and
usually base their regexes on Perl 5, since that's what regex users
The Perl people and the XML people just concluded a summit in which
agreed that Perl was the ideal XML parsing language--except for the
problem that XML is Unicode/10646-based and Perl is fundamentally
As a result, Larry Wall (Perl's creator) has agreed that upgrading
"Unicode compatibility" is his highest priority. Since Perl has
built right into the syntax of the language itself, this will
require him to
somehow implement a solution to regexes for Unicode. Once he has
done so, it
is not unreasonable to assume that a lot of other developers will
suit and implement his regex solutions in their own software and in
commercial class libraries.
Larry is a very bright guy, and has a linguistics degree, but I
but think that the whole world would benefit from a bit of
between him and some of you on the Unicode/Unicore mailing lists.
right in Perl now will probably save a lot of aggravation in the
I encourage anyone who is interested to check out:
and go ahead and contact Larry. He's working at O'Reilly right now,
probably something like email@example.com, or something similar.
For what it's worth, here's my $0.02:
I think that all matches, ranges, and sorts should accept explicit
that specify exactly what matches what, what series of chars is
subsetted by a range, and what sorts before what. This way, all
dependence can be eliminated from the code. I also think that an
of abstraction is needed. Rather than comparing bits directly, you
convert strings to a canonical form before attempting a match.
On some machines, a given argument (such as a "use ISO 8859-1 for
ranges" statement) would be compiled/interpreted right out of the
because it referred to the native encoding of the machine and could
implemented in the DFA/NFA directly. The "canonical form" in many
would just be the unconverted data. Matching Latin-1 search strings
Latin-1 data shouldn't involve converting both to Unicode.
On other machines, that encoding/matching/range/collating argument
have to be backed by a subroutine that could be either a standard
customized by the programmer, or implemented by the programmer from
Such subroutines would have a standard interface, but no specified
implementation. Whether for matching (both char encoding conversion
equivalence info), ranges (series of chars in desired order from
ranges/subsets are taken), or sorting (collation rules), they could
be implemented via table lookup, or they could be algorithmic. This
allow the programmer to either use pre-built subroutines
various standards, or override the subroutine with his own custom
It would also mean that a smart compiler/interpreter would be able
completely optimize out the subroutines in many cases, making
regexes just as fast as non-Unicode regexes at doing plain 'ol
and matches on a plain 'ol Latin-1 machine, while still allowing
Hangul in one encoding to be used as a search string for
in another encoding--albeit more slowly.
I suppose that if the arguments are not specified explicitly, the
behavior could default to locale-specific behavior, but I'm not fond
that behaves differently on different platforms. It's asking for
I'd rather the defaults be fixed, but that can get pretty
the arguments *are* explicit, though, the same code shouldn't
differently in different locales. It should just have more efficient
efficient implementations, depending on how closely the underlying
matched the operations required by the code.
Regarding Ken's suggestion of more user-friendly regex approaches,
in favor of user-friendly regex-building "wizards". You could allow
to see the resulting regex, even allowing them to edit it directly
desired, or keep it hidden internally. I do think it's likely,
there will be enough regex library code out there that most
want to use regexes internally for powerful searches, regardless of
interface shown to the user.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT