Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Hallvard B Furuseth (
Date: Thu Mar 12 1998 - 05:47:32 EST

I wrote:

>> In particular, I wonder about
>> character ranges: If the user says "[-]" in his 8-bit charset (not
>> latin-1),

Please note that I used two accented characters. Let's use the example
[-] instead, it's easier to see that they are non-ASCII.

I'd be delighted if anyone has made a sensible definition of [7-bit ->
8-bit] ranges too, but I didn't want to complicate the issue too much.
Maybe that was silly, though -

Jeroen Hellingman wrote:

> This opens up a whole range of challenges. I would say that this will
> depends on the user's locale. for example, if I am Danish, I would
> expect all letters from A to A-ring to match, if I say [A-],
> according to the Danish alphabet. I England I might expect to get all
> letters A, irrespective of any accents on them.

I disagree there. I would expect a character range like [A-] to be the
characters numbered from char('A') to char('') in the charset I am
using. (For me, latin-1. Which is an uninteresting example because the
useful character codes are the same as in Unicode).

Or maybe non-ASCII character ranges would simply be forbidden. If so,
can anything replace them? The exist because they are useful...

I won't expect programs to give all characters the correct collating
sequence in my language -- if nothing else, because a program often
can't known which language it is looking at. It only knows the charset.
Sometimes it can ask the user about the language, but not always.

> It would be quite unexpected to match allmost all chaharacters
> if some user enters [A-Z], when the Z happens to
> come from the compatibility zone at the high end of Unicode.
> This means you'll have to do some locale defined normalistion on your
> data before pattern matching, comparable with sorting and searching
> operations.


> I wouldn't bother about the original charset, when using
> Unicode, the user expects Unicode.

The user may not even *know* about Unicode. If he does and that's "his"
charset, everything is wonderful. But I was thinking of the situation
where the *user* is basically using some 8-bit character set and the
*program* is using Unicode (and translates input from the user's charset
to Unicode). Then we'll either have to dump regexp character ranges, or
define some way the program can know when the user means a range of his
native characters, and when he means a range of Unicode characters, or
define some equally useful alternative to ranges.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT