> On the subject of regular-expression support for Unicode, the POSIX
> definition of regexps includes recognition of character classes.  I
> believe that the regexp package in GNU gawk, available at
> 
> 	ftp://prep.ai.mit.edu/pub/gnu/gawk-3.0.3.tar.gz ,
> 
> has the POSIX definition implemented.  While it is still based on
> 8-bit characters, it might prove a suitable starting point for Unicode
> support.
Have anybody defined or implemented "Unicode regular expressions" for a
program which uses Unicode internally?  In particular, I wonder about
character ranges: If the user says "[À-Å]" in his 8-bit charset (not
latin-1), then the program should use the characters from À to Å in the
user's charset, not the range of iso10646 character codes from À to Å.
So it seems that Unicode strings containing regexps must be tagged with
their "source charset".  OTOH, [\200-\377] probably means "all non-ASCII
characters".  And how do you say "all non-ascii Unicode characters"?
[\200-\3777777777]? :-)
-- Hallvard
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT