Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Peter Westlake (peter@harlequin.co.uk)
Date: Fri Mar 13 1998 - 07:06:23 EST


At 00:01 1998-03-13 -0800, Kolbjørn Aambø@unicode.org wrote:
>After reading this discussion through for the last few minutes I just
>wander if there with UNICODE characters is any alternative to spesifying a
>sequence of interesting characters and their relations like this
>"aboriginal character spesification":
>
>Aa:á:Àà:â:Ãã,Bb,Cc:Çç,Dd,Ee:Ééèêë,Ff,Gg,Hh,I:¡iíìîï,Jj,Kk,Ll,Mm,Nn:Ññ,Oo:óòô:Õõ:
>‘¦,Pp,Qq,Rr,Ss,Tt,Uu:úùû,Vv,Ww,Xx,Yy:Üü,Zz,Ææ:Ää,Øø:Öö,Åå.
>
>call it a local collating sequence if you wish...
>
>THEN the Regualar expresision [A..Å] will at least mean all characters in
>the above sequence as I see it. All characters that are NOT mentioned in
>the aboriginal character spesification will then be deamed outside by the
>regular expression...

Thinking about matching from first principles, it is surely the
case that any given user has some set of strings in mind when
entering a search expression. So what we need is a syntax in
which they can say what they mean. For instance, if I'm looking
for an English word that ends in E, I surely want to match
"café", so I want to ask for the set of words ending with
"e with any accent that can appear in English", i.e. with a
member of a particular set of characters. So far, so good.

Now, if I want to find a word beginning with A in a list of
scientific words used in English, then I would hope to find
"Ångstrøm". But if I were searching for names beginning with
A in the Danish telephone directory, it would be a mistake to
find "Ångstrøm". So I need to say what I mean. If I want to
match A-F in English, I need a short way of saying whether to
include accents and case and of saying that I mean English.
Something like [A-F::u,a,uk] where u means upper case, a means
any accent, uk is from a standard list of codes. The range is
interpreted in the context of the UK collating sequence. To
omit Ångstrøms, I would ask for ^[A::u,a,dk]* meaning "a string
beginning with a letter that matches A in Danish". In this context,
"Danish" and "English" can be seen as equivalence relations that
partition the character set into equivalence classes. Kolbjørn
gave an example of such a relation.

The parts of the "u,a,uk" notation can all be seen as describing
equivalence classes, but the country code (or arbitrary name of
a collating sequence) modifies the other parts. Being very formal
about it for a minute, the other parts are equivalence relations
from sets of equivalence relations indexed by collation sequence
names, so "[A::u,a,uk]" is really saying:

Intersection of
  Class named A or containing A in LetterEquivalences(uk)
    and
  Class Upper of Cases(uk)

Accents don't appear in this, because any accent will do,
and the union of all equivalence classes of Accents(uk) is
the whole character set. I'm using () for indexing rather
than [] to avoid confusion with character classes. You
could drop the idea of indexing and think of relations
with names like BritishLetterEquivalences, but that makes
it a little harder to work out which relation is meant from
the notation.

Ranges are taken to be defined in the context of the collation
sequence too, so "a-f" means the first six classes of the
(ordered) equivalence relation Letters(uk).

All we need now is a nice notation, and plenty of standard
equivalence classes.

Peter.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT