Re: regular expressions

From: Mark Davis (
Date: Mon Feb 03 1997 - 16:07:15 EST

Alain LaBont/e'/ wrote:
> At 10:18 97-02-03 -0800, Mark Davis wrote:
> >Rick wrote:
> >>
> >> La Bonte' says...
> >>
> >> > What about the Icelandic "thorn", which in CAN/CSA standard Z243.4.1
> >> > (Ordering) comes with "th" while in the European prenorm on ordering and in
> >> > ISO/IEC CD 14651 it comes after "z"?
> >>
> [Rick]:
> >> This is precisely the problem. Somehow a balance has to be achieved to make
> >> it possible to get this RIGHT, both ways. I have to be able to specify each
> >> of:
> >>
> >> A. the stuff that is a letter of *my* alphabet, in proper order
> >> B. the stuff that is in some RANGE of my alphabet
> >> C. the stuff that is "universally alphabetic" (at Beckers' arm's length)
> >>
> >> [...]
> [Mark]:
> >1. I suspect that in most cases, people don't care if it matches a
> >letter "outside of *my* alphabet" if it is in the same script. That is,
> >if I specify "Latin & UppercaseLetter", I don't care if THORN happens to
> >match, because I don't expect it in my target material, which happens to
> >be English (at least in Modern English).
> [Alain]:
> I happen to fundamentally disagree with this. Unless you only deal with your
> parish (and even there, nowadays, I'm not even sure!), you may have to send
> mail in other countries, for example, and then you have to care in
> applications about characters that are not those of your language.
> Furthermore even Modern English uses "foreign characters" (read the TIME
> magazine, to which I subsribe -- even if they are clearly francophobe, they
> use an accented character and Rench words with correctly written accents in
> every page!). If these characters are entered by cultured person, you have
> to care.
> As an anecdote, I just received with 2 weeks of delay the CD-ROM on "Alta
> Vista Search-my-computer" written on the envelope with my name distorted, my
> employer's name distorted, my address completely distorted, just because
> some filtering was made somewhere... it is a miracle that it was not
> returned to the sender (really, they make miracles -- even the postal code
> was garbled!)... And nevertheless when this company writes me advertisement
> by email, my name is OK! So all is not perfect in the US world...

I realize as you do that filtering unknown characters is a problem.
However, I think you are missing my point. In regular expressions, you
are producing a pattern that will match certain characters. Rather than
list them all, there is a shorthand that people use, which is to list
ranges of code points. My point is:
For this *particular* application, usually when people list "a-z", they
really mean "Latin Letters", or often, just "Letters". The latter is
actually usually BETTER for the problems that you list than restricting
it to a particular range for a particular language.

Even better would be to look at common practice and separate out *more*
higher level divisions, such as "Vowel" which often arise in regular

> [Mark]:
> >2. In each case, what we have is some way of specifying a set of Unicode
> >characters; ideally, the syntax would let me form arbitrary set
> >manipulations based on those: union, intersection and inversion (since
> >Unicode is a closed set, you don't need to be restricted to
> >set-difference). To make that match with current syntax, you can just
> >use "," as an OR operator, but you would need to add & and ! (AND and
> >NOT). Then you could do stuff like
> >
> > Latin & UppercaseLetter & !X-Z, CurrencySymbol
> >
> >for every latin uppercase letter but X,Y,Z, plus all currency symbols.
> [Alain]:
> To me this range feature is a mispractice, unless it is redesigned to
> include internationalization. But even so that might not remove
> misconcepotions in programmers' heads, they need fundamental courses, like
> to learn that even with the Latin script, a to z is not all!

However, there are times where software does only recognize certain
letters, and has to be able to do so. A C compiler, unlike Java, doesn't
allow accented letters in identifiers. If you have to mimic that
behavior, then you want to use a precise description of the characters.
> [Mark]:
> >3. One could have ranges sort, that is that "a-z" means all letters that
> >are greater than or equal to "a" and less then or equal to "z" in some
> >sorting sequence. I suspect that that is overkill for regular
> >expressions, given that they are primarily programmer tools.
> [Alain]:
> If it is a mere programmer tool, it is a programmer tool for
> English-speaking programmers only... with extremely parochial needs. And if
> it is internal to a machine, why use the alphabet, which is mainly aimed at
> humans... Do programmers only work for programmers?

Not exactly. See above.
> [Mark]:
> > You would
> >need to add some syntax for specifying the sorting order, or else the
> >regular expression would do different things in different locales;
> >sometimes you want that, but sometimes you don't.
> [Alain]:
> You should always be dependent on a locale for this but if this is a
> function that deals with alphabet it should not even be hard-wired in a
> program, it should be functionally programmed with the functionality
> localized from the outside.

Precisely. And since the pattern string (where not coming from input
from a power-user) should come from a localized resource, then it *can*
have explicit character ranges; it is up to the localizer to specify the
appropriate list.

> Alain LaBonti
> Quibec
> cc ^orvarpur Kari Slafsson,
> Staplarap Mslands (STRM),
> Reykjavmk, Msland
> CEN/TC304

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT