On Thu, Nov 30, 2000 at 05:18:59AM -0800, Brendan Murray/DUB/Lotus wrote:
> Branislav Tichy <firstname.lastname@example.org> wrote:
> > b) there are compound words, which have these sequences on a word border,
> > and in this case, they stands for two separate graphemes and _are_ sorted
> > as c+h, d+z a.s.f.
> > the proper collation algorithmus would therefore have to realise (imho),
> > whether there is one or two graphemes (whether the word is compound)!
> There are similar situations in many languages. Possibly more complicated
> is the use of graphemes which usually contract but don't in some cases. For
> example, the "aa" sequence as in "gaard" in Danish is traditionally sorted
> as å (a-ring), after ø (o-slash), but in other situations, particularly in
> names, the "aa" is really "a"+"a", and should be sorted before "b". How can
> this be catered for algorithmically?
Yes, the Slovak problem may look like the Dansih "aa" problem.
Just for the record, "aa" normally means "å" in Danish names,
eg. Søndergaard is the last name of one of the persons that
has been responsible for SC2 matters in Danish Standards.
"gaard" is pronounced like "gård". I have no examples off my head on
Danish names where "aa" actually means two a-s, pronounced as two sounds.
The rule from the danish orthography book is that if the two
a's are pronounced as two sounds, they are also sorted as two sounds, as
two A's. If it is pronounced as one sound, then it is sorted as an "å"
(irrespectively of whether the sound is an "a" sound).
> My guess is that there are only two possible solutions:
> 1. use an exceptions list, or
> 2. break the grapheme with some marker like ZWNJ to prevent the
> Obviously the first creates a maintenance nightmare, and the latter has to
> be somehow tagged to store the data correctly. In any case there's no
> simple solution.
The two a sounds occur in combined words, like ekstraarbejde (extra work).
The recommendation from danish standards is to introduce a soft-hyphen SHY
between the A's. This also works for iso-8859-1.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT