Regular expressions in Unicode (Was: Ethiopic text)

From: Hallvard B Furuseth (
Date: Thu Mar 12 1998 - 03:43:20 EST

> On the subject of regular-expression support for Unicode, the POSIX
> definition of regexps includes recognition of character classes. I
> believe that the regexp package in GNU gawk, available at
> ,
> has the POSIX definition implemented. While it is still based on
> 8-bit characters, it might prove a suitable starting point for Unicode
> support.

Have anybody defined or implemented "Unicode regular expressions" for a
program which uses Unicode internally? In particular, I wonder about
character ranges: If the user says "[└-┼]" in his 8-bit charset (not
latin-1), then the program should use the characters from └ to ┼ in the
user's charset, not the range of iso10646 character codes from └ to ┼.
So it seems that Unicode strings containing regexps must be tagged with
their "source charset". OTOH, [\200-\377] probably means "all non-ASCII
characters". And how do you say "all non-ascii Unicode characters"?
[\200-\3777777777]? :-)


