Regular expressions in Unicode (Was: Ethiopic text)

From: Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Date: Thu Mar 12 1998 - 03:43:20 EST


> On the subject of regular-expression support for Unicode, the POSIX
> definition of regexps includes recognition of character classes. I
> believe that the regexp package in GNU gawk, available at
>
> ftp://prep.ai.mit.edu/pub/gnu/gawk-3.0.3.tar.gz ,
>
> has the POSIX definition implemented. While it is still based on
> 8-bit characters, it might prove a suitable starting point for Unicode
> support.

Have anybody defined or implemented "Unicode regular expressions" for a
program which uses Unicode internally? In particular, I wonder about
character ranges: If the user says "[À-Å]" in his 8-bit charset (not
latin-1), then the program should use the characters from À to Å in the
user's charset, not the range of iso10646 character codes from À to Å.
So it seems that Unicode strings containing regexps must be tagged with
their "source charset". OTOH, [\200-\377] probably means "all non-ASCII
characters". And how do you say "all non-ascii Unicode characters"?
[\200-\3777777777]? :-)

-- 
Hallvard



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT