Re: sequences and stuff

From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Fri Dec 01 2000 - 06:14:20 EST


On Thu, 30 Nov 2000, Brendan Murray/DUB/Lotus wrote:

> There are similar situations in many languages. Possibly more complicated
> is the use of graphemes which usually contract but don't in some cases. For
> example, the "aa" sequence as in "gaard" in Danish is traditionally sorted
> as (a-ring), after (o-slash), but in other situations, particularly in
> names, the "aa" is really "a"+"a", and should be sorted before "b". How can
> this be catered for algorithmically?
>
> My guess is that there are only two possible solutions:
> 1. use an exceptions list, or
> 2. break the grapheme with some marker like ZWNJ to prevent the
> contraction.
>
> Obviously the first creates a maintenance nightmare, and the latter has to
> be somehow tagged to store the data correctly. In any case there's no
> simple solution.

The situation is somehow worse with Persian. The letter "U+0622, Alef
With Madda Above", when at the middle of a word, is treated based on its
root when sorted. This letter, although pronounced the same, may be a
letter of its own (with Persian root), or may be a Hamza+Alef, and treated
like a ligature when being sorted. The librarians who know the meaning of
the words, have no problem when sorting, but the poor computer programs,
you know. Any ideas for different markup? If you need examples, you can
take "MEEM ALEF-MADDA KHAH THAL" which is sorted like "MEEM HAMZA ALEF
KHAH THAL" (Hamza is sorted after Alef in Persian) and "MEEM FARSI-YEH REH
ALEF-MADDA BEH" in which the Alef-Madda is considered a single unit,
sorted before Alef.

--roozbeh



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT