Re: Addition of remaining two Maltese Characters to Unicode

From: Peter_Constable@sil.org
Date: Tue Aug 01 2000 - 10:13:18 EDT


>However if the 'ie' in friend is simply assumed to be 'ie', not 'i' + 'e'
>the list would be sorted (incorrectly) as,
>
>> frigate
>> friend
>> id-dar
>> iehor
>> liema
>
>This issue is very important in Maltese since most Maltese persons are
>bi-lingual (and even tri-lingual) resulting in frequent borrowing of
foreign
>words (especially of English origin).

But do "friend" and "frigate" appear in Maltese dictionaries? Is it
reasonable to expect that a single collating spec can correctly order words
that follow two different collating conventions all at once? There is no
single correct way to sort multilingual text; this is obvious when talking
about mixing, say, Thai words and Russian words since the scripts are
different, but it is no less true in the case of two languages that happen
to share the same script: each writing system follows its own conventions.

Suppose it were the case that Maltese alphabetic order put the letter l
before f. (I realise this is hypothetical, but it's a realistic scenario
for some other language, and points out the difficulties in adding
characters to a universal character set in order to deal with
language-specific processing issues. Even if the example doesn't apply to
Maltese, there will be some other language out there for which it does, as
I illustrate below.) Then even if you could deal with the <i><e> vs. <ie>
distinction, you'd still end up with the following (incorrect) order to the
same list (with a couple of additions):

lieutenant
little
liema
friend
frigate
id-dar
iehor

Here, the four English words are out of order. To correct this, would we be
asking for LATIN CAPTITAL (or SMALL) LETTER MALTESE L? Don't count on it!

The fact that the situation your dealing with happens to involve a digraph
is coincidental to the more significant point: words from different
languages sort differently, and in general you can't expect to mix them and
be able to have them all sort *correctly* according to the conventions of
each language since the two conventions may be in conflict with each other.

The number of cases of digraphs in other languages that have special
collation requirements that contradict the collation of the same sequence
of letters in English is *enormous*. Let me give some other examples: I'm
holding in my hands a book that discusses alphabets for a sample of African
languages. For the very first language (Adja, spoken in Benin), there are
four digraphs - gb, kp, ny, sh - that sort independent of the graphemes
that consist only of the initial letters - g, k, n, s. As a result, if we
had a mixture of English (or French - more likely for Benin) and Adja
words, we couldn't sort them using a single set of rules and have them come
out correct for both languages at once. That's just the first language, and
four more digraphs in Unicode, please. Oh, and did I mention that in this
language the letter x sorts between h and i (here's exactly the situation I
described earlier)? "Can we have LATIN SMALL/CAPITAL LETTER ADJA X please?"
Then I move on to the second language, Bariba (also in Benin). It uses two
of the same digraphs, so nothing new on that front, but it sorts d before
c, so again a mixture of English (or French) and Bariba words can sort
correctly for the two different language's conventions at once. "Can we
have either LATIN SMALL/CAPITAL LETTER BARIBA D or LATIN SMALL/CAPITAL
LETTER BARIBA C please?" Need I continue?

It is simply impractical to have Unicode live up to the expectations of
being able to do what you're asking in general: make it possible to mix
words from different languages and have a single collating spec sort them
so that they are correct for the different conventions of the different
languages all at once. Even if we adding hundreds of new characters for
these kinds of needs, then we'd have a real mess because data would get
encoded in all kinds of ways. (Just because there is a character for your
special digraph, that doesn't mean that people won't generate data in which
that grapheme is encoded as a character sequence, as has been mentioned
happens for ij in Dutch). Again, the fact that you're situation involves a
digraph is only coincidental, and doesn't change the principle.

Nevertheless the fact that you've got a digraph may also allow you to get
away with what you're wanting to do in the specific case of Maltese, if
you're willing to use a ZERO WIDTH GRAPHEME JOINER (and assuming it gets
accepted).

In general, the best solution is: if a language has borrowings from another
language, they take on the conventions of the receptor language. In most
cases, people are likely to end up doing that anyway.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT