Re: regular expressions

From: odonnell@zk3.dec.com
Date: Wed Feb 05 1997 - 14:10:14 EST


I've seen lots of replies to the original question about support for
non-English characters in UNIX regular expressions (REs), but I haven't
seen a detailed explanation of what support exists. Here is what
is available on systems that support POSIX and/or XPG (X/Open
Portability Guide). Regular expressions have been internationalized.

There are three major concepts that didn't exist in
non-internationalized versions of REs:

1. Collating symbols -- a character or group of characters that
must be treated as a unit for collation. The "ch" in traditional
Spanish sorting is a classic example. Here's an example of the
syntax:

   ^[a-[.ch.]] matches any word that begins with
                a, a-grave, b, c, ch (in a typical
                Spanish locale)

2. Equivalence classes -- a group of character or collating
symbols that sort to the same primary location. For example,
in some languages, the letter "e" with and without diacritics
and in both uppercase and lowercase sort to the same primary
location. They therefore make up an equivalence class. Here's
an example of the syntax:

   [[=e=]f] matches all characters that sort to the same
                primary location as "e", as well as matching
                the letter "f"

3. Character class expressions -- match the characters that are
defined as being part of a named character class. Certain classes
*must* be supported on standards-compliant systems (alnum, alpha,
blank, cntrl, digit, graph, lower, print, punct, space, upper,
and xdigit), while others are implementation-defined. Here's an
example of the syntax:

   [[:upper:]] matches all characters defined as uppercase in
                the current locale

All of these internationalized RE concepts can include characters
beyond what is in ASCII. And because all use the current locale,
the "same" RE can yield different results when run under different
locales. For example, [[:upper:]] returns only ASCII A-Z when run
under a C or POSIX locale, but returns a larger group of characters
when run under a typical French or Russian or Japanese or etc.
locale.

In addition to these RE concepts, POSIX allows implementations to
add non-standard classes to locales. The national standard Japanese
locale is a good example of one that contains such classes. They
include: kanji, kana, jpunct, and others. XPG provides two functions
for accessing implementation-defined classes (though they also
can be used for standard classes). They are:

   wctype_t wctype(const char *charclass);
   int iswctype(wint_t wc, wctype_t charclass);

The first tests whether the value of charclass is valid for the
current locale. If it is, it returns a value of type wctype_t
for use in the second function. iswctype() tests the wide
character value wc to determine if it is part of the named
charclass.

For Unicode users, as implemented on most UNIX-like systems, you
can use UTF-8 encoded data in the REs and they'll work fine. The
XPG functions need data in UCS-2 or UCS-4 form.

I hope this helps clarify matters.

                -- Sandra
-----------------------
Sandra Martin O'Donnell
odonnell@zk3.dec.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT