Re: regular expressions

From: Rick McGowan (Rick_McGowan@next.com)
Date: Thu Jan 30 1997 - 14:34:43 EST


Hmm... I guess this has provoked a bit of discussion... Good... I might
eventually get what I'm looking for...!

Tony Harminc brought up a good topic...

>> 2. Has anyone given any serious thought to extensions of said Unixoid
>> regular expression syntax to handle non-English alphabets used as
>> "ranges" for pattern matching?
>
> Presumably any such extensions would need to include concepts of sort order
> if they are going to handle ranges. In other words the whole Unicode
> character properties database is not a sufficient resource;

I think sort order a la 14651 might be overkill here. What I want in a regex
language/syntax is to be able to specify not so much ranges per se, as ranges
in reference to something. (The default regex used by "ed", "grep" and other
tools implicitly has an "alphabet" of the ASCII range, in that order.) I
interpret the spirit more like "A-Za-z" means "the alphabet, upper and lower
case". The extension I'd like is to be able to specify "the alphabet" I'm
concerned with, in some specified order, and then use the regex short-hand to
point out ranges within it. So my "alphabet" would need to be defined
somewhere (maybe in the environment) and its start/end points delineated also.

Sort order for "the alphabet of current concern" is an issue; overall
non-binary collation of all Unicode isn't an issue. Grep and such tools are
typically working on regular expressions within programming languages,
regularized data files, or readable text. Patterns we're searching for are
typically linguistically or programmatically meaningful patterns with only a
(relatively speaking) small range of variability with regard to their
character repertoire.

It might make sense to provide syntax to refer to external tables (by name or
some other method)... but we lose one of the really nice things about the
Unixoid regular expression system: compactness.

Also, I do realize, yes, that speaking about 'alphabet' here is a bit of a
misnomer perhaps; and I know that it doesn't necessarily scale well to all
non-Latin type scripts, or to Han characters necessarily "as is"; and so
forth. I'm looking for something comparable to the Unix Regular Expression
syntax, beefed-up to handle a similar problem space, within the Unicode Data
realm.

While I've had replies about proprietary or "not released" technology, it
sounds like Mark Leisher at UNM may have a solution; and it would be
advantageous in its free availability.

I do hope that the Unix people, whoever they are, are listening and we don't
end up with too many different syntaxes for doing basically the same thing.

        Rick



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT