Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Peter Westlake (peter@harlequin.co.uk)
Date: Fri Mar 13 1998 - 19:13:21 EST


At 10:20 1998/03/13 -0800, Keld J|rn Simonsen wrote:
>=?iso-8859-1?Q?Kolbj=F8rn?==?iso-8859-1?Q?_?==?iso-8859-1?Q?Aamb=F8?=@unicode.org writes:
>
>> Peter Westlake <peter@harlequin.co.uk> wrote:
>> :
>> >Now, if I want to find a word beginning with A in a list of
>> >scientific words used in English, then I would hope to find
>> >"=C5ngstr=F8m". But if I were searching for names beginning with
>> >A in the Danish telephone directory, it would be a mistake to
>> >find "=C5ngstr=F8m". So I need to say what I mean. If I want to
>> >match A-F in English, I need a short way of saying whether to
>> >include accents and case and of saying that I mean English.
>> >Something like [A-F::u,a,uk] where u means upper case, a means
>> >any accent, uk is from a standard list of codes. The range is
>> >interpreted in the context of the UK collating sequence. To
>> >omit =C5ngstr=F8ms, I would ask for ^[A::u,a,dk]* meaning "a string
>> >beginning with a letter that matches A in Danish". In this context,
>> >"Danish" and "English" can be seen as equivalence relations that
>> >partition the character set into equivalence classes. Kolbj=F8rn
>> >gave an example of such a relation.
>
>You should normally treat a search pattern according to the
>locale of the user, not the originator. So the user will get
>things matched and sorted according to his/her own expectations,
>the rules that the producer used should not matter. It is
>quite difficult to know all the rules of the data producer,
>eg the Danish telephone directory, would you know the rules there?
>I would bet that most people in the world do not know
>about Danish æøå sorting and matching rules, and even less
>the rules for aa ü ð þ ö ä and other letters.
>
>So rule number one: always sort and match according to
>the expectations of the user.

True, but you have to guess what those are, and you can't
always be right. If your search mechanism has an explicit
notion of which collation sequence is being used in
an expression, the decision passes from the implementor of the
matching package to the application writer, who can decide
whether or when to pass it on to the user.

I didn't mean to suggest "use the originator's sequence" as
a rule. I just picked the Danish telephone book as an example
because it was different to my first example, English.
In the example I am a sophisticated user who (thinks he) knows
enough to ask for a particular sequence.

>For sophisticated users, you could then say, I expect results
>according to this specific foreign collating sequence.

That's right. Such users might be allowed to type in regular
expressions using the full syntax. Less sophisticated users
may have less freedom, depending on what the application
writer thinks they can understand. There might be an Advanced
Options section in a Preferences dialog, where anyone who
wants more choices can ask for them. Or whatever. It all
depends on the application.

Peter.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT