From: Asmus Freytag (firstname.lastname@example.org)
Date: Tue Oct 02 2007 - 13:04:14 CST
On 10/2/2007 10:59 AM, Michael Maxwell wrote:
> I hesitate to jump into this thread, but:
> Asmus Freytag wrote:
>> Depending on how many accented letters a language uses,
>> writing the equivalent expression manually can be both
>> tedious and error-prone.
> Aren't there two issues here that need to be separated:
> (1) the issue of what some regex *means*, e.g. what ^X means, where X is some regex.
> (2) the question of how easy it is to enter X on a computer.
In ASCII/English these are tied up inextricably, so that you can't
always get good guidance on what is the correct (expected) way to extend
these to other sets/scripts/languages.
Does ^[a-k] mean "search for terms with initial a,b,c,d,e,f,g,h,i,j,k"
or does it mean, "search for any term where the initial falls between
'a' and 'k' inclusive"?
As long as you *strictly* match by code points, the former
interpretation is clearly preferred. But the minute you start treating A
WITH RING and A + COMBINING RING ABOVE as equivalent, this becomes less
And if you throw in the ability to specify collation elements inside the
[ ], then you've left behind the assumption that what you are matching
is strings of character codes and entered the realm where what you are
matching is strings of grapheme clusters, or collation elements.
What I'm trying to point out is that you can define regex notations for
both, but you should probably be consistent and not mix the models.
> I would hate to make the meaning of some regex counter-intuitive just because it's hard to type with today's software.
I don't think I was advocating that.
This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 13:07:02 CST