Re: regular expressions

From: Alain LaBont/e'/ (
Date: Mon Feb 03 1997 - 15:48:03 EST

At 10:18 97-02-03 -0800, Mark Davis wrote:
>Rick wrote:
>> La Bonte' says...
>> > What about the Icelandic "thorn", which in CAN/CSA standard Z243.4.1
>> > (Ordering) comes with "th" while in the European prenorm on ordering and in
>> > ISO/IEC CD 14651 it comes after "z"?

>> This is precisely the problem. Somehow a balance has to be achieved to make
>> it possible to get this RIGHT, both ways. I have to be able to specify each
>> of:
>> A. the stuff that is a letter of *my* alphabet, in proper order
>> B. the stuff that is in some RANGE of my alphabet
>> C. the stuff that is "universally alphabetic" (at Beckers' arm's length)
>> [...]

>1. I suspect that in most cases, people don't care if it matches a
>letter "outside of *my* alphabet" if it is in the same script. That is,
>if I specify "Latin & UppercaseLetter", I don't care if THORN happens to
>match, because I don't expect it in my target material, which happens to
>be English (at least in Modern English).

I happen to fundamentally disagree with this. Unless you only deal with your
parish (and even there, nowadays, I'm not even sure!), you may have to send
mail in other countries, for example, and then you have to care in
applications about characters that are not those of your language.
Furthermore even Modern English uses "foreign characters" (read the TIME
magazine, to which I subsribe -- even if they are clearly francophobe, they
use an accented character and Rench words with correctly written accents in
every page!). If these characters are entered by cultured person, you have
to care.

As an anecdote, I just received with 2 weeks of delay the CD-ROM on "Alta
Vista Search-my-computer" written on the envelope with my name distorted, my
employer's name distorted, my address completely distorted, just because
some filtering was made somewhere... it is a miracle that it was not
returned to the sender (really, they make miracles -- even the postal code
was garbled!)... And nevertheless when this company writes me advertisement
by email, my name is OK! So all is not perfect in the US world...

>2. In each case, what we have is some way of specifying a set of Unicode
>characters; ideally, the syntax would let me form arbitrary set
>manipulations based on those: union, intersection and inversion (since
>Unicode is a closed set, you don't need to be restricted to
>set-difference). To make that match with current syntax, you can just
>use "," as an OR operator, but you would need to add & and ! (AND and
>NOT). Then you could do stuff like
> Latin & UppercaseLetter & !X-Z, CurrencySymbol
>for every latin uppercase letter but X,Y,Z, plus all currency symbols.

To me this range feature is a mispractice, unless it is redesigned to
include internationalization. But even so that might not remove
misconcepotions in programmers' heads, they need fundamental courses, like
to learn that even with the Latin script, a to z is not all!

>3. One could have ranges sort, that is that "a-z" means all letters that
>are greater than or equal to "a" and less then or equal to "z" in some
>sorting sequence. I suspect that that is overkill for regular
>expressions, given that they are primarily programmer tools.

If it is a mere programmer tool, it is a programmer tool for
English-speaking programmers only... with extremely parochial needs. And if
it is internal to a machine, why use the alphabet, which is mainly aimed at
humans... Do programmers only work for programmers?

> You would
>need to add some syntax for specifying the sorting order, or else the
>regular expression would do different things in different locales;
>sometimes you want that, but sometimes you don't.

You should always be dependent on a locale for this but if this is a
function that deals with alphabet it should not even be hard-wired in a
program, it should be functionally programmed with the functionality
localized from the outside.

Alain LaBonté

cc Þorvarður Kári Ólafsson,
   Staðlaráð Íslands (STRÍ),
   Reykjavík, Ísland

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT