Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Alain LaBont\i\ (alb@sct.gouv.qc.ca)
Date: Thu Mar 12 1998 - 09:25:47 EST


A 02:37 98-03-12 -0800, Hallvard B Furuseth a écrit :
>I wrote:
>
>>> In particular, I wonder about
>>> character ranges: If the user says "[À-Å]" in his 8-bit charset (not
>>> latin-1),
>
>Please note that I used two accented characters. Let's use the example
>[Ø-ß] instead, it's easier to see that they are non-ASCII.
>
>I'd be delighted if anyone has made a sensible definition of [7-bit ->
>8-bit] ranges too, but I didn't want to complicate the issue too much.
>Maybe that was silly, though -
>
>Jeroen Hellingman wrote:
>
>> This opens up a whole range of challenges. I would say that this will
>> depends on the user's locale. for example, if I am Danish, I would
>> expect all letters from A to A-ring to match, if I say [A-Å],
>> according to the Danish alphabet. I England I might expect to get all
>> letters A, irrespective of any accents on them.
>
>I disagree there. I would expect a character range like [A-Å] to be the
>characters numbered from char('A') to char('Å') in the charset I am
>using. (For me, latin-1. Which is an uninteresting example because the
>useful character codes are the same as in Unicode).
>
>Or maybe non-ASCII character ranges would simply be forbidden. If so,
>can anything replace them? The exist because they are useful...
>
>I won't expect programs to give all characters the correct collating
>sequence in my language -- if nothing else, because a program often
>can't known which language it is looking at. It only knows the charset.
>Sometimes it can ask the user about the language, but not always.
>
>> It would be quite unexpected to match allmost all chaharacters
>> if some user enters [A-Z], when the Z happens to
>> come from the compatibility zone at the high end of Unicode.
>> This means you'll have to do some locale defined normalistion on your
>> data before pattern matching, comparable with sorting and searching
>> operations.
>
>Agreed.
>
>> I wouldn't bother about the original charset, when using
>> Unicode, the user expects Unicode.
>
>The user may not even *know* about Unicode. If he does and that's "his"
>charset, everything is wonderful. But I was thinking of the situation
>where the *user* is basically using some 8-bit character set and the
>*program* is using Unicode (and translates input from the user's charset
>to Unicode). Then we'll either have to dump regexp character ranges, or
>define some way the program can know when the user means a range of his
>native characters, and when he means a range of Unicode characters, or
>define some equally useful alternative to ranges.
>
>--
>Hallvard

[Alain] :
Although this kind of practice is, if we talk about general-purpose
appplication, a very bad programming technique, as long as there won't be a
firm international standard convention (unfortunately there is a defacto
standard [quite "international" among computer specialists] in some
programming languages to that effect, exactly what Halvard expects), there
is in the couple of standards projects that are ISO/IEC 14651 and 14652
(under current ISO/IEC FCD ballot), a convention that is established in
practice to ease ellipsis definitions.

14652 describes the form <character symbol 1>...<character symbol 2> to
define a coded-character dependent ellipsis (well, what you call a "regular
expresion", thanks to remind me this very ambiguous term which I had forgot
the meaning -- we just saw it in revising the ISO 14652 standard last week
and we did not know what it was refering to exactly, it seems to be a
C-language-specific expression, but I'm not a C-language specialist)...

14652 also defines (it is Keld Simonsen's proposal, fine with me) for the
needs of ISO/IEC 14651 (Intenrnational String Ordering [and matching]
Standard) two dots instead of three, to define a *code-independent*
ellipsis, using the UCS code as the international ellipsis reference,
regardless of the actual coding used...

i.e. in ISO/IEC 14651:

<U00000001>..<UEFFFFFFF<> means the range of all characters from 1 to
xEFFFFFFF in UCS order, regardless of the coded character set under the hood.

I agree that this should *not* be locale-dependent though and that it
should only be used in defining tables that have no
natural-language-specific dependency. We should not spread the bad
programming technique, just because what is aimed in programming is just
about this, it is absolutely locale-dependent if used in general-purpose
applications, and using this to find matching strings makes a very
parochial program, not localizable at all, not only between coutries, but
also between different platforms in the same country. And it guarantees a
mess for end-users, agreatly affecting their productivity in real life. It
is not because they would not find something that it is not in the data
base they are searching for, but they will conclude so, and this will raise
the adrenaline level in the blood flow of their customers... Imagine a
bank-robber who has to wait for the cash just because they do not retrieve
something in a data base with a regular expression (: Poor thief! (;

Alain LaBonté
Québec



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT