Re: regular expressions

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 30 1997 - 17:00:08 EST


Alain LaBonti wrote:

>
> At 12:59 97-01-30 -0800, Mark Davis wrote:
> >[A-C,a-c,@Sc!$] meaning letters a-z and currency symbols, except for $
> ^Z ^z
>
> Sorry for my ignorance, but does it means a to z including <a!> (`), <a^>
> (b), <c,> (g) [and upper case and so on] too? What about the Icelandic
> "thorn", which in CAN/CSA standard Z243.4.1 (Ordering) comes with "th" while
> in the European prenorm on ordering and in ISO/IEC CD 14651 it comes after "z"?

In traditional regular expressions, [A-Z] would mean ASCII 0x41..0x5A, which is
one of the reasons why people have been mentioning unexpected results which occur
on EBCDIC platforms (where 'A'..'Z' is not a contiguous range of characters).

In keeping with the design paradigm of Unicode as "Wide ASCII" (cf. page 1-1
of the Unicode Standard), a regular expression syntax of the sort [\u0041-\u005a]
should refer to the contiguous range in the *universal* character set, namely
Unicode, and should have unambiguous interpretation on all platforms which implement
Unicode correctly. Issues of arbitrary alphabet repertoires and language orderings
should be completely beside the point. A regular expression to specify all and
only the content of the Icelandic alphabet would simply contain the complete
list of Unicode characters required; a good regular expression syntax would have
a convenient way to refer to such a list through a symbolic shortcut (as suggested
by Mark Davis' and Mark Leisher's contributions).

In Mark Davis' syntax:

>exp := @category

>So you could say

>[@L!A] meaning all letters except for A

The trick is getting everybody's usage of the categories to make
sense.

For that, see the category field in the latest update of the Unicode Character
Database, and the associated property list files, all available by anonymous
ftp from the unicode.org site:

ftp://unicode.org/pub/2.0-Update

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT