Re: regular expressions

From: Geoffrey Waigh (gwaigh@teklogix.com)
Date: Thu Jan 30 1997 - 14:44:51 EST


At 10:20 AM 30/01/97 -0800, Tony Harminc wrote:
>> 2. Has anyone given any serious thought to extensions of said Unixoid regular
>> expression syntax to handle non-English alphabets used as "ranges" for pattern
>> matching?
>
>Presumably any such extensions would need to include concepts of sort
>order if they are going to handle ranges. In other words the whole
>Unicode character properties database is not a sufficient resource;
>some sort standard (preferably the nascent ISO 14651) tables are also
>needed. Doesn't make for a small grep (or whatever).

Well people specifying ranges in the form [a-z] have always been taking the
risk with the collation order. In particular it probably doesn't mean what
you want on your EBCDIC host. (If your EBCDIC regex package interprets
such ranges to follow the ASCII convention, what will it do with [0-_] ?)

I think that POSIX style ranges [[:digit:]] being extended to cover Unicode
concepts would be the first and most useful step. Being able to specify
[a-e]* to match cha when your locale is Spanish might be handy for some
people, but probably will open a can of worms that prevent the typical user
from figuring out how the regular expression automata is interacting with
collation mechanism.

Geoffrey Waigh
gwaigh@teklogix.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT