Ken Whistler writes:
And one of the fundamental issues addressed squarely by
the Single UNIX Specification for Regular Expressions,
in the document pointed to by Xiao-He, is the dependence
of regexp range expressions of locale-specific collations:
"Range expressions must not be used in portable applications
because their behaviour is dependent on the collating
sequence. Ranges will be treated according to the current
collating sequence, . . . This, however, means that the
interpretation will differ depending on collating sequence."
I think this is poorly worded. Applications that are
depending on a single interpretation of ranges should
not use range expressions because they are indeed
locale-specific. But an application can use internationalized
range expressions and be perfectly portable as long as it
doesn't depend on one single behavior for those expressions.
. . .
Extending regular expression syntax to the universal
character set opens up an enormous can of worms. In
my opinion, no amount of jiggering the locale model and
the regexp syntax is really going to "solve" this problem.
. . .
Keep in mind that end users don't use regular expressions
(unless forced to by user-vicious UI's) -- it is UNIX
programmers who use regular expressions.
I disagree. Programmers use regular expressions because
that is the mechanism they have had to give users "logical"
behavior. Just as users want to see lists in an order that
makes sense to them, they often want to grab subsets of those
lists -- and what a subset includes differs depending on
the language they speak. Regular expressions have been an
extremely common way to give users the varying subsets they want.
That said, regular expressions obviously were designed with
small, related character repertoires in mind.
There is then
an enormous house of cards of programs and tools built
up on the basis of regexp pattern matching. The
foundations of that house of cards are rotten, however,
and the house will not stand when 38,000 characters
try to move in.
Yes, the regexp stuff was not designed for large repertoires,
and trying to expand it to that uncovers some "rotten foundations."
In my opinion, people should be thinking more generically
about how to extend and abstract the concepts of
string pattern matching in the context of the universal
character set, rather than focussing on how to "fix"
regexp syntax per se for Unicode.
That's reasonable. Some concepts don't expand infinitely
well. However, whatever replaces regexp still has to deal
with users' varying expectations of what a given range
includes. Users definitely should not have to be aware of
how characters are encoded or whether they're using a large
or small coded character set; their ranges should "just work."
Sandra Martin O'Donnell
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT