Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Jeroen Hellingman (
Date: Thu Mar 12 1998 - 04:28:36 EST

> Have anybody defined or implemented "Unicode regular expressions" for a
> program which uses Unicode internally? In particular, I wonder about
> character ranges: If the user says "[-]" in his 8-bit charset (not
> latin-1), then the program should use the characters from to in the
> user's charset, not the range of iso10646 character codes from to .
> So it seems that Unicode strings containing regexps must be tagged with
> their "source charset". OTOH, [\200-\377] probably means "all non-ASCII
> characters". And how do you say "all non-ascii Unicode characters"?
> [\200-\3777777777]? :-)

This opens up a whole range of challenges. I would
say that this will depends on the user's locale. for example,
if I am Danish, I would expect all letters from A to A-ring
to match, if I say [A-], according to the Danish alphabet. I England
I might expect to get all letters A, irrespective of any accents on
them. It would be quite unexpected to match allmost all chaharacters
if some user enters [A-Z], when the Z happens to
come from the compatibility zone at the high end of Unicode. This
means you'll have to do some locale defined normalistion on your
data before pattern matching, comparable with sorting and searching
operations. I wouldn't bother about the original charset, when using
Unicode, the user expects Unicode.

Jeroen Hellingman
+---- Jeroen Hellingman ---------------------------------------------------------+
| work: Ericsson Telecommunicatie B.V., Ericssonstraat 2, Rijen, The Netherlands |
| Department ETM/RPU, Room 17116 |
| Tel: +31 161 242022 (834 2022), E-mail: <> |
| home: Aletta Jacobsstraat 5, 3404 XD IJsselstein, The Netherlands |
| Tel: +31 30 6875444, E-mail: <> |
| Homepage: <> |

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT