RE: Regular expressions in Unicode (Was: Ethiopic text)

From: Gianni Mariani (
Date: Fri Mar 13 1998 - 11:56:49 EST

        I'm confused about this discussion.

        Regular expressions translate themselves to state machines. State
machines can be used on unicode strings just like any other encoding. I
have most of the makings of a lexical scanner generator for Unicode that I
wrote years ago.

        Syntax like "*" and "?" have very generic meaning and work fine with
Unicode and they translate to a set of states and tranitions.

        What am I missing ?

        Hallvard replied:

> I'm a UNIX programmer, but also a UNIX user.

        Tautologous, I'm afraid. Any UNIX user who uses a regexp, is, as I
        see it, by definition a UNIX programmer.

        I'm not pooh-poohing the need for regexp's, or their usefulness
        in making things happen on a UNIX system. But the set of people
        who know more about regexp (on any platform) than the use of "*"
        in directory listings is basically a disjunct set from the
        "users" of interest to 98% of the world software market.

> > In my opinion, people should be thinking more generically about
how to
> > extend and abstract the concepts of string pattern matching in
> > context of the universal character set, rather than focussing on
> > to "fix" regexp syntax per se for Unicode.
> What exactly do you have in mind?

        For example, a mechanism which is widely surfaced in database
        for pattern matching is provision of a pattern mask. E.g. using "L"
        stand for any letter, "N" for any digit, "?" for any character, etc.
        By virtue of their simplicity, these are much easier to explain to
        who can make useful choices with them *despite* the lack of power

        Or the LIKE clause in a typical SQL implementation, which uses a
        very, very pared down regexp syntax ("%" for a match of zero or more

        characters, "_" for a match of a single character, [abc] for a set,
        [a-f] for a range, and "^" for exclusion from a set or range).

        Or the Microsoft Word find expression syntax (essentially another
        example of masking), which uses "^?" for any character, "^#" for any
        "^$" for any letter, etc., as well as format specifications which
        carried in the find string itself.

        Or the Microsoft Excel find expression "wildcard" syntax -- even
simpler, using
        "?" to match a single character or "*" to match any string.

        Not that any of these is "better" than general regexp, but they are
        in widespread use and are more comprehensible to end users. (Even
        LIKE clause syntax is often hidden completely behind the UI of
        an application that finds some other way to surface choices to the
        user for narrowing down a match.)

        Actually, what I had in mind was more along the lines of a serious,
        holistic analysis of what string pattern matching means in a
        character set context, accompanied by some thinking about how
        abstractions for pattern matching could result in different levels
        (implemented differently), depending on application needs.

> I don't think the syntax is
> important, but the power and compactness *is* important - and then
> end up with more or less the same syntax.

        Sometimes *lack* of power is important. A Turing machine is not the
        proper engineering answer for provision of a device to turn a light
        on or off.

> People are not going to write
> 20 lines of grammar or whatever if they could write two
> regexps.

        Substitute "UNIX programmers" for "People" in that statement, and I
        would agree!

> Or did you mean to base this alternative on something else
> than Deterministic Finite Automatons? If so, what?

        As an engineer, I try to build what is needed. If, for an
        a DFA is required, then I would build that. But the needs might be
        greater or lesser, depending.

        Sandra O'Donnell responded:

> Keep in mind that end users don't use regular expressions
> (unless forced to by user-vicious UI's) -- it is UNIX
> programmers who use regular expressions.
> I disagree. Programmers use regular expressions because
> that is the mechanism they have had to give users "logical"
> behavior. Just as users want to see lists in an order that
> makes sense to them, they often want to grab subsets of those
> lists -- and what a subset includes differs depending on
> the language they speak. Regular expressions have been an
> extremely common way to give users the varying subsets they want.

        I certainly concur that regular expressions are a useful way
        for programmers to implement logical behavior that then
        gets surfaced to a user somehow. But whether regular expressions
        per se are an "extremely common way to give users the varying
        subsets they want," -- I dunno. I don't see any use in
        Microsoft Office, in Browsers, or in Web search engines.
        (Lycos finds "aardvark" just fine, but barfs on "aardv??k" or
        "aardv*k".) And those I *would* consider extremely common

> In my opinion, people should be thinking more generically
> about how to extend and abstract the concepts of
> string pattern matching in the context of the universal
> character set, rather than focussing on how to "fix"
> regexp syntax per se for Unicode.
> That's reasonable. Some concepts don't expand infinitely
> well. However, whatever replaces regexp still has to deal
> with users' varying expectations of what a given range
> includes. Users definitely should not have to be aware of
> how characters are encoded or whether they're using a large
> or small coded character set; their ranges should "just work."

        I agree completely. This is one of the essential reasons for
        rethinking the whole issue in terms of the Universal
        Character Set. We need to divorce the problem of what it means
        to specify a range from the particular encoding contingencies.
        A range should be defined *on* Unicode *for* a particular
        collation. How that range is implemented for
        software running in some other encoding than Unicode is
        then a separate issue. (My tentative answer would be to
        implement a Unicode engine inside and convert expressions
        to Unicode before evaluation -- the same answer that parsers
        in general should take.)

        Of course, that then raises the issue of what it means to
        define a range *for* a particular collation. This is going
        to be very messy, since the concept of a mapping between one
        collation and another is very complex -- much more complex
        than the concept of a mapping between one character set
        encoding and another. But the work on ISO 14651 and the
        Default Unicode Collation work is starting to address
        these issues.


> -----------------------
> Sandra Martin O'Donnell

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT