Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Jeroen Hellingman (etmjehe@genesis.etm.ericsson.se)
Date: Thu Mar 12 1998 - 07:27:57 EST


> I'd be delighted if anyone has made a sensible definition of [7-bit ->
> 8-bit] ranges too, but I didn't want to complicate the issue too much.
> Maybe that was silly, though -

I am afraid, when using Unicode, many appearantly simple things become
complicated. A range probably can be defined most sensible if based
on a collation order.

> Please note that I used two accented characters. Let's use the example
> [-] instead, it's easier to see that they are non-ASCII.

This is an example that is very hard to resolve, as they mix letters
from two different writting systems (Danish and German). Using the
Unicode code points to define the range would be unexpected as well.
 
> > This opens up a whole range of challenges. I would say that this will
> > depends on the user's locale. for example, if I am Danish, I would
> > expect all letters from A to A-ring to match, if I say [A-],
> > according to the Danish alphabet. I England I might expect to get all
> > letters A, irrespective of any accents on them.
>
> I disagree there. I would expect a character range like [A-] to be the
> characters numbered from char('A') to char('') in the charset I am
> using. (For me, latin-1. Which is an uninteresting example because the
> useful character codes are the same as in Unicode).
>
> Or maybe non-ASCII character ranges would simply be forbidden. If so,
> can anything replace them? The exist because they are useful...

The char('A') to char('') solution seems too confusing to me. I am
thinking of Unicode; what if the happens to be the ngstrm symbol?
I also don't like forbidding characters beyond U+007F.
It will make Russian users, for example, very unhappy. It is also important
to note that some scripts are included in Unicode out of alphabetical
order (like Gurumukhi), and order is not well defined on the huge range
of ideographs. For the latter, using them in a range may well be forbidden,
just as ranges that are not well defined in your locale.
 
> I won't expect programs to give all characters the correct collating
> sequence in my language -- if nothing else, because a program often
> can't known which language it is looking at. It only knows the charset.
> Sometimes it can ask the user about the language, but not always.

I _will_ expect exactly that. Most sensible Unicode applications will
have to know the language anyway, and users cannot be expected to know
all Unicode, or even Latin-1 and ASCII. I realise this is a very tough
thing. Work is on its way (ISO 14651) to define a default collating
sequence.
 
> The user may not even *know* about Unicode. If he does and that's "his"
> charset, everything is wonderful. But I was thinking of the situation
> where the *user* is basically using some 8-bit character set and the
> *program* is using Unicode (and translates input from the user's charset
> to Unicode). Then we'll either have to dump regexp character ranges, or
> define some way the program can know when the user means a range of his
> native characters, and when he means a range of Unicode characters, or
> define some equally useful alternative to ranges.

You are right the the user may not know about Unicode; he may even not
know about any code at all, but think of the letters in the alphabetic
order he knows it, in which case he will expect that order when he
gives a range. The problem is what to do with letters outside that
his field of knowledge, ASCII can be overseen, but Unicode is too large
for most people to oversee the effects of a range selection.

I was not thinking of the situation you describe, and although we
may have to live with such things for a while, I certainly hope
applications translating to and from Unicode and 8 bit sets will
disapear sooner or later. What is needed to promote this is a good
and easy to use API to handle most of the Unicode trouble, and make
text processing easy for programmers again. Pattern matching will
certainly need to be part of such an API.

Jeroen

+---- Jeroen Hellingman ---------------------------------------------------------+
| work: Ericsson Telecommunicatie B.V., Ericssonstraat 2, Rijen, The Netherlands |
| Department ETM/RPU, Room 17116 |
| Tel: +31 161 242022 (834 2022), E-mail: <etmjehe@etm.ericsson.se> |
| home: Aletta Jacobsstraat 5, 3404 XD IJsselstein, The Netherlands |
| Tel: +31 30 6875444, E-mail: <jehe@kabelfoon.nl> |
| Homepage: <http://members.tripod.com/~jhellingman> |
+--------------------------------------------------------------------------------+



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT