Re: Regular expressions in Unicode (Was: Ethiopic text)

Date: Thu Mar 12 1998 - 15:52:05 EST

Ken Whistler writes:
   And one of the fundamental issues addressed squarely by
   the Single UNIX Specification for Regular Expressions,
   in the document pointed to by Xiao-He, is the dependence
   of regexp range expressions of locale-specific collations:
   "Range expressions must not be used in portable applications
   because their behaviour is dependent on the collating
   sequence. Ranges will be treated according to the current
   collating sequence, . . . This, however, means that the
   interpretation will differ depending on collating sequence."

I think this is poorly worded. Applications that are
depending on a single interpretation of ranges should
not use range expressions because they are indeed
locale-specific. But an application can use internationalized
range expressions and be perfectly portable as long as it
doesn't depend on one single behavior for those expressions.

   . . .
   Extending regular expression syntax to the universal
   character set opens up an enormous can of worms. In
   my opinion, no amount of jiggering the locale model and
   the regexp syntax is really going to "solve" this problem.
   . . .
   Keep in mind that end users don't use regular expressions
   (unless forced to by user-vicious UI's) -- it is UNIX
   programmers who use regular expressions.

I disagree. Programmers use regular expressions because
that is the mechanism they have had to give users "logical"
behavior. Just as users want to see lists in an order that
makes sense to them, they often want to grab subsets of those
lists -- and what a subset includes differs depending on
the language they speak. Regular expressions have been an
extremely common way to give users the varying subsets they want.

That said, regular expressions obviously were designed with
small, related character repertoires in mind.

   There is then
   an enormous house of cards of programs and tools built
   up on the basis of regexp pattern matching. The
   foundations of that house of cards are rotten, however,
   and the house will not stand when 38,000 characters
   try to move in.

Yes, the regexp stuff was not designed for large repertoires,
and trying to expand it to that uncovers some "rotten foundations."
   In my opinion, people should be thinking more generically
   about how to extend and abstract the concepts of
   string pattern matching in the context of the universal
   character set, rather than focussing on how to "fix"
   regexp syntax per se for Unicode.

That's reasonable. Some concepts don't expand infinitely
well. However, whatever replaces regexp still has to deal
with users' varying expectations of what a given range
includes. Users definitely should not have to be aware of
how characters are encoded or whether they're using a large
or small coded character set; their ranges should "just work."

Sandra Martin O'Donnell

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT