Re: regular expressions

From: Mark Davis (mark_davis@taligent.com)
Date: Mon Feb 03 1997 - 12:12:08 EST


unicode@Unicode.ORG wrote:
>
> La Bonte' says...
>
> > What about the Icelandic "thorn", which in CAN/CSA standard Z243.4.1
> > (Ordering) comes with "th" while in the European prenorm on ordering and in
> > ISO/IEC CD 14651 it comes after "z"?
>
> This is precisely the problem. Somehow a balance has to be achieved to make
> it possible to get this RIGHT, both ways. I have to be able to specify each
> of:
>
> A. the stuff that is a letter of *my* alphabet, in proper order
> B. the stuff that is in some RANGE of my alphabet
> C. the stuff that is "universally alphabetic" (at Beckers' arm's length)
>
> Each of these has its place in regex searches; sometimes I want the one and
> not t'other. I have to know, in the syntax of the regular expression what the
> request to "give me all the stuff from a to thorn" means; and it has to mean
> the same thing at the time the regular expression is defined, compiled (if
> so), and actually executed.
>
> I presume that Mr. Leisher has this problem licked, too. :-) ??
>
> Rick

1. I suspect that in most cases, people don't care if it matches a
letter "outside of *my* alphabet" if it is in the same script. That is,
if I specify "Latin & UppercaseLetter", I don't care if THORN happens to
match, because I don't expect it in my target material, which happens to
be English (at least in Modern English).

2. In each case, what we have is some way of specifying a set of Unicode
characters; ideally, the syntax would let me form arbitrary set
manipulations based on those: union, intersection and inversion (since
Unicode is a closed set, you don't need to be restricted to
set-difference). To make that match with current syntax, you can just
use "," as an OR operator, but you would need to add & and ! (AND and
NOT). Then you could do stuff like

  Latin & UppercaseLetter & !X-Z, CurrencySymbol

for every latin uppercase letter but X,Y,Z, plus all currency symbols.

3. One could have ranges sort, that is that "a-z" means all letters that
are greater than or equal to "a" and less then or equal to "z" in some
sorting sequence. I suspect that that is overkill for regular
expressions, given that they are primarily programmer tools. You would
need to add some syntax for specifying the sorting order, or else the
regular expression would do different things in different locales;
sometimes you want that, but sometimes you don't.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT