Re: Addition of remaining two Maltese Characters to Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 01 2000 - 14:08:11 EDT


John Cowan asked:

> > Suppose it were the case that Maltese alphabetic order put the letter l
> > before f.
>
> I have a recollection of seeing a list of Chinese words written in pinyin
> but alphabetized according to bopomofo rules. Is this commonplace?

In addition to the Indic instances cited by Jörg, there are many, many
instances of Latin orthographies invented by linguists for languages that
had no writing system (in the Americas, Africa, Southeast Asia, Oceania,
Northeast Asia, etc.) where the orthography was deliberately harmonized
with the phonology of the language and where the alphabetic order used
for reference materials is some variant of the articulation order first
devised by Panini for Sanskrit. For example, Nootka materials are in
Panini articulation order: uvulars and velars first, then coronals and
dentals, then labials, then liquids. Other systems simply move all the
voicing manner distinctions together, but keep an otherwise somewhat
Latinate order. For example: b p ph p' d t th t' g k kh k' ...
There are as many variations on these kinds of schemes for reference
lexicons as there are types of phonological systems and imaginative
linguists out there.

> > As a result, if we
> > had a mixture of English (or French - more likely for Benin) and Adja
> > words, we couldn't sort them using a single set of rules and have them come
> > out correct for both languages at once.
>
> To be sure. The trouble arises when the digraph sometimes sorts one way
> and sometimes another. IIRC, in Danish "aa" sorts as "å" when it is an
> archaic rendering of it, as in "Aarhus", but as "aa" when it is a borrowing,
> as in "aardvark". That's the same case as Maltese, no? What is commonly
> done for Danish sorting?
>
> > In general, the best solution is: if a language has borrowings from another
> > language, they take on the conventions of the receptor language.
>
> A good principle, but it doesn't always work in the Real World.

Basically, I agree with Peter here. The first order approximation is to
always follow the conventions of the receptor language. Beyond that, you
are starting to deal with mixed language conventions or other special
cases (e.g. as for handling numbers or abbreviations as if they were spelled
out, rather than based on their character values) that require sorting
capabilities beyond tailored multilevel weighting algorithms for collation.

> Summary: Of course I agree with you that adding characters is not the answer,
> and that tailored collation sequences mostly are --- but some languages
> may have collation rules that look internally inconsistent when represented
> in Unicode, and may require tricks with ZWNBSP, which I think is the Right
> Thing in this case.
>

To handle the general mixed language sorting case by actually mixing
language ordering rules, you need in general to have language markup
(or a good dictionary lookup mechanism) engaged to identify which item
is to engage which set of rules.

Maltese may be a special case because it follows basically a single set
of rules, but wants a single exception for "ie" sequences that occur
in non-nativized borrowed words from English (or presumably other languages
as well). (I suspect this is a matter of language politics and policy
as much as anything else.)

Well, in that type of situation, I agree with John Cowan that a solution
using ZWNBSP would be sufficient. This would act as the "language tag"
in this particular case, but could do so in a way that generic collation
algorithms could deal with. All you need is:

Maltese ie = <i><e> ==> weight as an {ie} unit.
English ie = <i><ZWNBSP><e> ==> weight as sequence of weights {i} {e}

(Or the other way around, depending on whether you want the Maltese
or the English data to be the "marked" case.)

You are still faced with getting your data properly marked up for sorting.
But that is the equivalent problem for any of the proposed solutions,
as Peter pointed out. If you need to maintain a distinction *in data*
between the two forms, then you have to enter them differently. And
there is no need for a new digraph character to accomplish that.

Frankly, I don't think heading down this road is an advisable one for
Maltese, since it would just create maintenance and consistency problems
for Maltese data. The better route is to live with the slightly "off"
results for placement of English words including "ie" in otherwise
Maltese sorted lists -- none of which would cause failures to find
items by Maltese users, though pedants would point to cases where the
"dumb software" sorted the English "ie" wrong. And then in those instances
where the necessity for detailed sorting can justify the extra cost
of the software development (i.e., where special sorting is going to
be done anyway, for telephone listings, for dictionary publishing,
and so on), the extra work to identify the English-derived vocabulary
for special handling can be done. This can be done either by stop-lists
of words known to be at issue (like "friend"), or if the lists get
too long, by hooking into generic dictionaries for lookup.

In any case, encoding a digraph letter for Maltese doesn't solve
the problem, because it cannot prevent the need to deal with Maltese
data not spelled with the digraph -- and as Peter pointed out, it is
just special pleading for one language, when there are literally
thousands of cases out there of Latin digraphs or multigraphs which
have particular implications for ordering data. Encoding them all
as Unicode *characters* is clearly not where we want to head.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT