RE: Regular expressions in Unicode

From: Noriko Marshall (S&T Onsite) (a-norikm@microsoft.com)
Date: Fri Apr 17 1998 - 20:25:28 EDT


Glen, are you the person who used to work at BitStream?
Are you the person who hired me as a Japanese teacher?
If I'm wrong, please never mind.

        -----Original Message-----
        From: Glen Perkins [SMTP:gperkins@netcom.com]
        Sent: Thursday, March 19, 1998 12:19 PM
        To: Unicode List
        Subject: Re: Regular expressions in Unicode

        I'm very pleased to hear that you've contacted them. If I'm correct,
that
        what's done in Perl will serve as the model for most Unicode-savvy
regexes
        in the future, then getting to Larry Wall right now should save a
lot of
        wailing and gnashing of teeth in the future.

        Perl already has the ability to specify the rules for a sort. If you
write
        the following:

        @sortedListOfStrings = sort Danish @unsortedListOfStrings;

        and then you create a subroutine called "Danish" that follows a
prescribed
        API for sort routines and encapsulates Danish collation rules, it
will work.
        You can then use the word "Danish" as a modifier whenever you use
the
        keyword "sort" and get the desired result.

        This feature wasn't created for i18n purposes. It was primarily
intended to
        specify the field in a particular record that was to serve as the
sort key,
        and whether it was to be sorted numerically or alphabetically. It
just
        happens that once you have that capability, you can use it to
specify a
        specific set of collation rules for i18n work.

        If Larry is given a clear, reliable, authoritative checklist of
what's
        needed, it's likely that he'll be able to add that same ability (to
specify
        a rule set as an argument) to matching, ranges, types, and so forth.

        Once he has done so, everyone else who implements a regex package
for any
        programming language is likely to use Perl as the model (as they
currently
        tend to do) and implement the same features.

        Getting that checklist to him early is the key, IMO.

        __Glen Perkins__
        gperkins@netcom.com

        -----Original Message-----
        From: Mark Davis <medavis2@us.ibm.com>
        To: Unicode List <unicode@unicode.org>
        Date: Wednesday, March 18, 1998 1:57 PM
        Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)

        I contacted them, and will be trying to set up a DL for this topic.
I have a
        few more thoughts on the issues gathered from some other material.

        The problems
        are basically the following:

        1. Types
        Since it is a large character set, you want to extend the "types" of
        characters
        that you can match against--for example, so that "digit" maps to any
of the
        Unicode digits, etc. The basic Unicode character categories consists
of:
        Letters, Punctuation, Symbols, Marks, Numbers, Separators, Other.
Each of
        these
        has different subcategories: for Letter, you have Uppercase,
Lowercase,
        Titlecase, Modifier, Other (in this case, "Other" includes uncased
letters
        such
        as Chinese). There are additional character properties that would be
useful
        to
        add.

        A regular-expression mechanism may not choose to offer all of these,
but it
        should be extended to more than just the ones applicable to English.
If a
        regular expression could be expressed as much as possible in terms
of
        higher-level semantic constructs (such as "vowel" instead of
[aAeEiIoOuU],
        it
        makes it easier to work with different alphabets and languages.

        Some of the Unicode categories are absolute (normative), while some
may have
        slight variations according to the locale. For example, Turkish
considers I
        (0049) and dotless-i (0131) to be case variants.

        See ftp://ftp.unicode.org/Public/UNIDATA/README.TXT and Chapter 4 of
The
        Unicode Standard (TUS) for more info.

        2. Ranges
        The order of Unicode characters may differ substantially from the
order
        expected by users of a particular language. One has to decide, for
example,
        whether the list [a-] means:
        - the Unicode characters between 0061 and 00E5 (including 'z', 'Z',
'[', and
        '') or
        - the lowercase letters between them, or
        - the letters in that order in the users' locale (which does not
include 'z'
        in
        English, but does include it in Swedish).

        3. Boundaries
        One or more Unicode characters may make up what the user thinks of
as a
        character (call this a "user character"). For example, "G" +
acute-accent
        may
        be thought of as a single character by users, yet is actually
represented by
        two Unicode characters. The general notion of the boundaries between
        characters
        is fairly easy to describe.

        What constitutes a boundary between words is trickier, since it is
dependent
        not just on the adjacent characters. You can have a general
algorithm that
        takes care of most of the world for character and word boundaries.
Note:
        word
        boundaries and line-break boundaries are not, in general, the same.

        See Section 5-13 of TUS for more info.

        4. Canonical Equivalencies
        There are many instances where a user character can be equivalently
        expressed
        by two different sequences of Unicode characters.

        See Section 2-5 and 3.9 of TUS for more info.

        4. For cases 1-3 above, there are two choices:
        a. locale-independent: the same regular expression always produces
the same
        result.
        b. locale-dependent: the results match more closely to a "user's"
        expectations,
        such as for [a-] in Danish.

        My opinion is that the base regular-expression mechanism should be
        locale-independent; for performance, simplicity, and reliability. A
general
        specification can handle most of the world well without
modification.

        If someone wants particular modules that translate particular locale
        requirements into a locale-independent form, that could be done.
Thus, for
        example, something like MakeRegex could be used to make a range of
letters
        from
        a particular alphabet, or transform a regular expression that is
written in
        terms of higher-level locale-independent constructs (such as
"vowel") into
        one
        that has variations for a particular language. For example, one
could image
        a
        MakeRange('a','') that produced the list [a] in German, and
[a-z] in
        Swedish.

        Mark

        _____________
        Mark Edward Davis, Program Director,
        IBM Centre for Java Technology SV
        800 El Camino Real West, Mountain View, CA 94040
        voice: (408) 777-5116, fax: (650) 335-2215
        mailto:medavis2@us.ibm.com, mailto:president@unicode.org

        ======================
        gperkins@netcom.com on 98/03/18 02:17:46
        Please respond to unicode@unicode.org @ internet
        To: unicode@unicode.org @ internet
        cc:
        Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)

        I missed most of this Unicode regex discussion since I was down at
Internet
        World in Los Angeles, so I hope you'll forgive me for bringing it
back up.

        Perl is probably the gold standard for regex usage these days. Those
who
        produce commercial regex libraries for languages such as C++ and
Java
        usually base their regexes on Perl 5, since that's what regex users
demand.

        The Perl people and the XML people just concluded a summit in which
they
        agreed that Perl was the ideal XML parsing language--except for the
glaring
        problem that XML is Unicode/10646-based and Perl is fundamentally
"char ==
        byte" based.

        As a result, Larry Wall (Perl's creator) has agreed that upgrading
Perl to
        "Unicode compatibility" is his highest priority. Since Perl has
regexes
        built right into the syntax of the language itself, this will
require him to
        somehow implement a solution to regexes for Unicode. Once he has
done so, it
        is not unreasonable to assume that a lot of other developers will
follow
        suit and implement his regex solutions in their own software and in
various
        commercial class libraries.

        Larry is a very bright guy, and has a linguistics degree, but I
can't help
        but think that the whole world would benefit from a bit of
collaboration
        between him and some of you on the Unicode/Unicore mailing lists.
Getting it
        right in Perl now will probably save a lot of aggravation in the
future.

        I encourage anyone who is interested to check out:

        http://www.perl.com/perl-xml.html

        and go ahead and contact Larry. He's working at O'Reilly right now,
so he's
        probably something like lwall@ora.com, or something similar.

        For what it's worth, here's my $0.02:

        I think that all matches, ranges, and sorts should accept explicit
arguments
        that specify exactly what matches what, what series of chars is
being
        subsetted by a range, and what sorts before what. This way, all
platform
        dependence can be eliminated from the code. I also think that an
extra layer
        of abstraction is needed. Rather than comparing bits directly, you
would
        convert strings to a canonical form before attempting a match.

        On some machines, a given argument (such as a "use ISO 8859-1 for
all
        ranges" statement) would be compiled/interpreted right out of the
code,
        because it referred to the native encoding of the machine and could
be
        implemented in the DFA/NFA directly. The "canonical form" in many
such cases
        would just be the unconverted data. Matching Latin-1 search strings
to
        Latin-1 data shouldn't involve converting both to Unicode.

        On other machines, that encoding/matching/range/collating argument
might
        have to be backed by a subroutine that could be either a standard
feature,
        customized by the programmer, or implemented by the programmer from
scratch.
        Such subroutines would have a standard interface, but no specified
        implementation. Whether for matching (both char encoding conversion
and char
        equivalence info), ranges (series of chars in desired order from
which
        ranges/subsets are taken), or sorting (collation rules), they could
either
        be implemented via table lookup, or they could be algorithmic. This
would
        allow the programmer to either use pre-built subroutines
representing
        various standards, or override the subroutine with his own custom
version.
        It would also mean that a smart compiler/interpreter would be able
to
        completely optimize out the subroutines in many cases, making
Unicode
        regexes just as fast as non-Unicode regexes at doing plain 'ol
Latin-1 sorts
        and matches on a plain 'ol Latin-1 machine, while still allowing
decomposed
        Hangul in one encoding to be used as a search string for
pre-composed Hangul
        in another encoding--albeit more slowly.

        I suppose that if the arguments are not specified explicitly, the
regexes'
        behavior could default to locale-specific behavior, but I'm not fond
of code
        that behaves differently on different platforms. It's asking for
trouble.
        I'd rather the defaults be fixed, but that can get pretty
contentious. If
        the arguments *are* explicit, though, the same code shouldn't
function
        differently in different locales. It should just have more efficient
or less
        efficient implementations, depending on how closely the underlying
OS
        matched the operations required by the code.

        Regarding Ken's suggestion of more user-friendly regex approaches,
I'm all
        in favor of user-friendly regex-building "wizards". You could allow
the user
        to see the resulting regex, even allowing them to edit it directly
if
        desired, or keep it hidden internally. I do think it's likely,
though, that
        there will be enough regex library code out there that most
programmers will
        want to use regexes internally for powerful searches, regardless of
the
        interface shown to the user.

        __Glen Perkins__

        



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT