Re: Regular expressions in Unicode (Was: Ethiopic text)

From: x.zhang@utoronto.ca
Date: Wed Mar 11 1998 - 23:51:48 EST


Hello Alain

You may be interested to see the definition for regular expressions
in XPG4 (and now single UNIX spec). Please see URL
http://www.opengroup.org/onlinepubs/7908799/xbd/re.html
The "range" is also a defined term, and as you said correctly, it's
locale dependant. This perhaps should have answered the original question,
given the context. I think POSIX also has something almost the same.

Second, your approach to limiting the use of ".." is really a sensible
one. But, the ".." notation is different from the "-" notation, right?

And finally, in general, I don't think using ranges in a regular
expression is that bad. I'd say it's a "good thing". Now, being restricted
to a M$ platform, I feel much handicapped for being unable to search
easily for things that has digits 1 to 3 in them, or words starting with
capital letter A to M, for example. As an i18n example, I'd feel much
happier if I could search for Chinese characters containing a range of
radicals. One thing I thought that your ISO project is so useful is that
it can provide a sorting order (of single characters) to regular
expressions, no?

It's certainly a programming issue as to how and when to use regular
expressions. When presenting to an end-user, the programmer should have
defined everything for the user except for the only choice that's open.
The rest of choices we call "options", right? ;-)

Regards,

Xiao-He

On Thu, 12 Mar 1998, Alain [UNKNOWN-8BIT] LaBonté  wrote:

> A 02:37 98-03-12 -0800, Hallvard B Furuseth a écrit :
> >I wrote:
> >
> >>> In particular, I wonder about
> >>> character ranges: If the user says "[À-Å]" in his 8-bit charset (not
> >>> latin-1),
> >
> >Please note that I used two accented characters. Let's use the example
> >[Ø-ß] instead, it's easier to see that they are non-ASCII.
> >
> >I'd be delighted if anyone has made a sensible definition of [7-bit ->
> >8-bit] ranges too, but I didn't want to complicate the issue too much.
> >Maybe that was silly, though -
> >
> >Jeroen Hellingman wrote:
> >
> >> This opens up a whole range of challenges. I would say that this will
> >> depends on the user's locale. for example, if I am Danish, I would
> >> expect all letters from A to A-ring to match, if I say [A-Å],
> >> according to the Danish alphabet. I England I might expect to get all
> >> letters A, irrespective of any accents on them.
> >
> >I disagree there. I would expect a character range like [A-Å] to be the
> >characters numbered from char('A') to char('Å') in the charset I am
> >using. (For me, latin-1. Which is an uninteresting example because the
> >useful character codes are the same as in Unicode).
> >
> >Or maybe non-ASCII character ranges would simply be forbidden. If so,
> >can anything replace them? The exist because they are useful...
> >
> >I won't expect programs to give all characters the correct collating
> >sequence in my language -- if nothing else, because a program often
> >can't known which language it is looking at. It only knows the charset.
> >Sometimes it can ask the user about the language, but not always.
> >
> >> It would be quite unexpected to match allmost all chaharacters
> >> if some user enters [A-Z], when the Z happens to
> >> come from the compatibility zone at the high end of Unicode.
> >> This means you'll have to do some locale defined normalistion on your
> >> data before pattern matching, comparable with sorting and searching
> >> operations.
> >
> >Agreed.
> >
> >> I wouldn't bother about the original charset, when using
> >> Unicode, the user expects Unicode.
> >
> >The user may not even *know* about Unicode. If he does and that's "his"
> >charset, everything is wonderful. But I was thinking of the situation
> >where the *user* is basically using some 8-bit character set and the
> >*program* is using Unicode (and translates input from the user's charset
> >to Unicode). Then we'll either have to dump regexp character ranges, or
> >define some way the program can know when the user means a range of his
> >native characters, and when he means a range of Unicode characters, or
> >define some equally useful alternative to ranges.
> >
> >--
> >Hallvard
>
>
> [Alain] :
> Although this kind of practice is, if we talk about general-purpose
> appplication, a very bad programming technique, as long as there won't be a
> firm international standard convention (unfortunately there is a defacto
> standard [quite "international" among computer specialists] in some
> programming languages to that effect, exactly what Halvard expects), there
> is in the couple of standards projects that are ISO/IEC 14651 and 14652
> (under current ISO/IEC FCD ballot), a convention that is established in
> practice to ease ellipsis definitions.
>
> 14652 describes the form <character symbol 1>...<character symbol 2> to
> define a coded-character dependent ellipsis (well, what you call a "regular
> expresion", thanks to remind me this very ambiguous term which I had forgot
> the meaning -- we just saw it in revising the ISO 14652 standard last week
> and we did not know what it was refering to exactly, it seems to be a
> C-language-specific expression, but I'm not a C-language specialist)...
>
> 14652 also defines (it is Keld Simonsen's proposal, fine with me) for the
> needs of ISO/IEC 14651 (Intenrnational String Ordering [and matching]
> Standard) two dots instead of three, to define a *code-independent*
> ellipsis, using the UCS code as the international ellipsis reference,
> regardless of the actual coding used...
>
> i.e. in ISO/IEC 14651:
>
> <U00000001>..<UEFFFFFFF<> means the range of all characters from 1 to
> xEFFFFFFF in UCS order, regardless of the coded character set under the hood.
>
> I agree that this should *not* be locale-dependent though and that it
> should only be used in defining tables that have no
> natural-language-specific dependency. We should not spread the bad
> programming technique, just because what is aimed in programming is just
> about this, it is absolutely locale-dependent if used in general-purpose
> applications, and using this to find matching strings makes a very
> parochial program, not localizable at all, not only between coutries, but
> also between different platforms in the same country. And it guarantees a
> mess for end-users, agreatly affecting their productivity in real life. It
> is not because they would not find something that it is not in the data
> base they are searching for, but they will conclude so, and this will raise
> the adrenaline level in the blood flow of their customers... Imagine a
> bank-robber who has to wait for the cash just because they do not retrieve
> something in a data base with a regular expression (: Poor thief! (;
>
> Alain LaBonté
> Québec
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT