Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Jeroen Hellingman (etmjehe@genesis.etm.ericsson.se)
Date: Thu Mar 12 1998 - 04:28:36 EST

Next message: Hallvard B Furuseth: "Re: Regular expressions in Unicode (Was: Ethiopic text)"
Previous message: Bob Verbrugge: "unicodedata-2.0.14.txt"
Maybe in reply to: Hallvard B Furuseth: "Regular expressions in Unicode (Was: Ethiopic text)"
Next in thread: Hallvard B Furuseth: "Re: Regular expressions in Unicode (Was: Ethiopic text)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Have anybody defined or implemented "Unicode regular expressions" for a
> program which uses Unicode internally? In particular, I wonder about
> character ranges: If the user says "[À-Å]" in his 8-bit charset (not
> latin-1), then the program should use the characters from À to Å in the
> user's charset, not the range of iso10646 character codes from À to Å.
> So it seems that Unicode strings containing regexps must be tagged with
> their "source charset". OTOH, [\200-\377] probably means "all non-ASCII
> characters". And how do you say "all non-ascii Unicode characters"?
> [\200-\3777777777]? :-)

This opens up a whole range of challenges. I would
say that this will depends on the user's locale. for example,
if I am Danish, I would expect all letters from A to A-ring
to match, if I say [A-Å], according to the Danish alphabet. I England
I might expect to get all letters A, irrespective of any accents on
them. It would be quite unexpected to match allmost all chaharacters
if some user enters [A-Z], when the Z happens to
come from the compatibility zone at the high end of Unicode. This
means you'll have to do some locale defined normalistion on your
data before pattern matching, comparable with sorting and searching
operations. I wouldn't bother about the original charset, when using
Unicode, the user expects Unicode.

Jeroen Hellingman
+---- Jeroen Hellingman ---------------------------------------------------------+
| work: Ericsson Telecommunicatie B.V., Ericssonstraat 2, Rijen, The Netherlands |
| Department ETM/RPU, Room 17116 |
| Tel: +31 161 242022 (834 2022), E-mail: <etmjehe@etm.ericsson.se> |
| home: Aletta Jacobsstraat 5, 3404 XD IJsselstein, The Netherlands |
| Tel: +31 30 6875444, E-mail: <jehe@kabelfoon.nl> |
| Homepage: <http://members.tripod.com/~jhellingman> |
+--------------------------------------------------------------------------------+

Next message: Hallvard B Furuseth: "Re: Regular expressions in Unicode (Was: Ethiopic text)"
Previous message: Bob Verbrugge: "unicodedata-2.0.14.txt"
Maybe in reply to: Hallvard B Furuseth: "Regular expressions in Unicode (Was: Ethiopic text)"
Next in thread: Hallvard B Furuseth: "Re: Regular expressions in Unicode (Was: Ethiopic text)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT