Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Fri Mar 13 1998 - 13:26:30 EST


=?iso-8859-1?Q?Kolbj=F8rn?==?iso-8859-1?Q?_?==?iso-8859-1?Q?Aamb=F8?=@unicode.org writes:

> Peter Westlake <peter@harlequin.co.uk> wrote:
> :
> >Now, if I want to find a word beginning with A in a list of
> >scientific words used in English, then I would hope to find
> >"=C5ngstr=F8m". But if I were searching for names beginning with
> >A in the Danish telephone directory, it would be a mistake to
> >find "=C5ngstr=F8m". So I need to say what I mean. If I want to
> >match A-F in English, I need a short way of saying whether to
> >include accents and case and of saying that I mean English.
> >Something like [A-F::u,a,uk] where u means upper case, a means
> >any accent, uk is from a standard list of codes. The range is
> >interpreted in the context of the UK collating sequence. To
> >omit =C5ngstr=F8ms, I would ask for ^[A::u,a,dk]* meaning "a string
> >beginning with a letter that matches A in Danish". In this context,
> >"Danish" and "English" can be seen as equivalence relations that
> >partition the character set into equivalence classes. Kolbj=F8rn
> >gave an example of such a relation.

You should normally treat a search pattern according to the
locale of the user, not the originator. So the user will get
things matched and sorted according to his/her own expectations,
the rules that the producer used should not matter. It is
quite difficult to know all the rules of the data producer,
eg the Danish telephone directory, would you know the rules there?
I would bet that most people in the world do not know
about Danish sorting and matching rules, and even less
the rules for aa and other letters.

So rule number one: always sort and match according to
the expectations of the user.

For sophisticated users, you could then say, I expect results
according to this specific foreign collating sequence.

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT