RE: Regular expressions in Unicode

From: Noriko Marshall (S&T Onsite) (
Date: Fri Apr 17 1998 - 20:25:28 EDT

Glen, are you the person who used to work at BitStream?
Are you the person who hired me as a Japanese teacher?
If I'm wrong, please never mind.

        -----Original Message-----
        From: Glen Perkins []
        Sent: Thursday, March 19, 1998 12:19 PM
        To: Unicode List
        Subject: Re: Regular expressions in Unicode

        I'm very pleased to hear that you've contacted them. If I'm correct,
        what's done in Perl will serve as the model for most Unicode-savvy
        in the future, then getting to Larry Wall right now should save a
lot of
        wailing and gnashing of teeth in the future.

        Perl already has the ability to specify the rules for a sort. If you
        the following:

        @sortedListOfStrings = sort Danish @unsortedListOfStrings;

        and then you create a subroutine called "Danish" that follows a
        API for sort routines and encapsulates Danish collation rules, it
will work.
        You can then use the word "Danish" as a modifier whenever you use
        keyword "sort" and get the desired result.

        This feature wasn't created for i18n purposes. It was primarily
intended to
        specify the field in a particular record that was to serve as the
sort key,
        and whether it was to be sorted numerically or alphabetically. It
        happens that once you have that capability, you can use it to
specify a
        specific set of collation rules for i18n work.

        If Larry is given a clear, reliable, authoritative checklist of
        needed, it's likely that he'll be able to add that same ability (to
        a rule set as an argument) to matching, ranges, types, and so forth.

        Once he has done so, everyone else who implements a regex package
for any
        programming language is likely to use Perl as the model (as they
        tend to do) and implement the same features.

        Getting that checklist to him early is the key, IMO.

        __Glen Perkins__

        -----Original Message-----
        From: Mark Davis <>
        To: Unicode List <>
        Date: Wednesday, March 18, 1998 1:57 PM
        Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)

        I contacted them, and will be trying to set up a DL for this topic.
I have a
        few more thoughts on the issues gathered from some other material.

        The problems
        are basically the following:

        1. Types
        Since it is a large character set, you want to extend the "types" of
        that you can match against--for example, so that "digit" maps to any
of the
        Unicode digits, etc. The basic Unicode character categories consists
        Letters, Punctuation, Symbols, Marks, Numbers, Separators, Other.
Each of
        has different subcategories: for Letter, you have Uppercase,
        Titlecase, Modifier, Other (in this case, "Other" includes uncased
        as Chinese). There are additional character properties that would be

        A regular-expression mechanism may not choose to offer all of these,
but it
        should be extended to more than just the ones applicable to English.
If a
        regular expression could be expressed as much as possible in terms
        higher-level semantic constructs (such as "vowel" instead of
        makes it easier to work with different alphabets and languages.

        Some of the Unicode categories are absolute (normative), while some
may have
        slight variations according to the locale. For example, Turkish
considers I
        (0049) and dotless-i (0131) to be case variants.

        See and Chapter 4 of
        Unicode Standard (TUS) for more info.

        2. Ranges
        The order of Unicode characters may differ substantially from the
        expected by users of a particular language. One has to decide, for
        whether the list [a-] means:
        - the Unicode characters between 0061 and 00E5 (including 'z', 'Z',
'[', and
        '') or
        - the lowercase letters between them, or
        - the letters in that order in the users' locale (which does not
include 'z'
        English, but does include it in Swedish).

        3. Boundaries
        One or more Unicode characters may make up what the user thinks of
as a
        character (call this a "user character"). For example, "G" +
        be thought of as a single character by users, yet is actually
represented by
        two Unicode characters. The general notion of the boundaries between
        is fairly easy to describe.

        What constitutes a boundary between words is trickier, since it is
        not just on the adjacent characters. You can have a general
algorithm that
        takes care of most of the world for character and word boundaries.
        boundaries and line-break boundaries are not, in general, the same.

        See Section 5-13 of TUS for more info.

        4. Canonical Equivalencies
        There are many instances where a user character can be equivalently
        by two different sequences of Unicode characters.

        See Section 2-5 and 3.9 of TUS for more info.

        4. For cases 1-3 above, there are two choices:
        a. locale-independent: the same regular expression always produces
the same
        b. locale-dependent: the results match more closely to a "user's"
        such as for [a-] in Danish.

        My opinion is that the base regular-expression mechanism should be
        locale-independent; for performance, simplicity, and reliability. A
        specification can handle most of the world well without

        If someone wants particular modules that translate particular locale
        requirements into a locale-independent form, that could be done.
Thus, for
        example, something like MakeRegex could be used to make a range of
        a particular alphabet, or transform a regular expression that is
written in
        terms of higher-level locale-independent constructs (such as
"vowel") into
        that has variations for a particular language. For example, one
could image
        MakeRange('a','') that produced the list [a] in German, and
[a-z] in


        Mark Edward Davis, Program Director,
        IBM Centre for Java Technology SV
        800 El Camino Real West, Mountain View, CA 94040
        voice: (408) 777-5116, fax: (650) 335-2215,

        ====================== on 98/03/18 02:17:46
        Please respond to @ internet
        To: @ internet
        Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)

        I missed most of this Unicode regex discussion since I was down at
        World in Los Angeles, so I hope you'll forgive me for bringing it
back up.

        Perl is probably the gold standard for regex usage these days. Those
        produce commercial regex libraries for languages such as C++ and
        usually base their regexes on Perl 5, since that's what regex users

        The Perl people and the XML people just concluded a summit in which
        agreed that Perl was the ideal XML parsing language--except for the
        problem that XML is Unicode/10646-based and Perl is fundamentally
"char ==
        byte" based.

        As a result, Larry Wall (Perl's creator) has agreed that upgrading
Perl to
        "Unicode compatibility" is his highest priority. Since Perl has
        built right into the syntax of the language itself, this will
require him to
        somehow implement a solution to regexes for Unicode. Once he has
done so, it
        is not unreasonable to assume that a lot of other developers will
        suit and implement his regex solutions in their own software and in
        commercial class libraries.

        Larry is a very bright guy, and has a linguistics degree, but I
can't help
        but think that the whole world would benefit from a bit of
        between him and some of you on the Unicode/Unicore mailing lists.
Getting it
        right in Perl now will probably save a lot of aggravation in the

        I encourage anyone who is interested to check out:

        and go ahead and contact Larry. He's working at O'Reilly right now,
so he's
        probably something like, or something similar.

        For what it's worth, here's my $0.02:

        I think that all matches, ranges, and sorts should accept explicit
        that specify exactly what matches what, what series of chars is
        subsetted by a range, and what sorts before what. This way, all
        dependence can be eliminated from the code. I also think that an
extra layer
        of abstraction is needed. Rather than comparing bits directly, you
        convert strings to a canonical form before attempting a match.

        On some machines, a given argument (such as a "use ISO 8859-1 for
        ranges" statement) would be compiled/interpreted right out of the
        because it referred to the native encoding of the machine and could
        implemented in the DFA/NFA directly. The "canonical form" in many
such cases
        would just be the unconverted data. Matching Latin-1 search strings
        Latin-1 data shouldn't involve converting both to Unicode.

        On other machines, that encoding/matching/range/collating argument
        have to be backed by a subroutine that could be either a standard
        customized by the programmer, or implemented by the programmer from
        Such subroutines would have a standard interface, but no specified
        implementation. Whether for matching (both char encoding conversion
and char
        equivalence info), ranges (series of chars in desired order from
        ranges/subsets are taken), or sorting (collation rules), they could
        be implemented via table lookup, or they could be algorithmic. This
        allow the programmer to either use pre-built subroutines
        various standards, or override the subroutine with his own custom
        It would also mean that a smart compiler/interpreter would be able
        completely optimize out the subroutines in many cases, making
        regexes just as fast as non-Unicode regexes at doing plain 'ol
Latin-1 sorts
        and matches on a plain 'ol Latin-1 machine, while still allowing
        Hangul in one encoding to be used as a search string for
pre-composed Hangul
        in another encoding--albeit more slowly.

        I suppose that if the arguments are not specified explicitly, the
        behavior could default to locale-specific behavior, but I'm not fond
of code
        that behaves differently on different platforms. It's asking for
        I'd rather the defaults be fixed, but that can get pretty
contentious. If
        the arguments *are* explicit, though, the same code shouldn't
        differently in different locales. It should just have more efficient
or less
        efficient implementations, depending on how closely the underlying
        matched the operations required by the code.

        Regarding Ken's suggestion of more user-friendly regex approaches,
I'm all
        in favor of user-friendly regex-building "wizards". You could allow
the user
        to see the resulting regex, even allowing them to edit it directly
        desired, or keep it hidden internally. I do think it's likely,
though, that
        there will be enough regex library code out there that most
programmers will
        want to use regexes internally for powerful searches, regardless of
        interface shown to the user.

        __Glen Perkins__


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT