Re: Addition of remaining two Maltese Characters to Unicode

From: Mark Davis (markdavis@ispchannel.com)
Date: Tue Aug 01 2000 - 13:32:05 EDT


We do not currently have a character that would serve the purpose being discussed.

The functions of the ZWNBSP and ZWSP are to forbid/allow linebreak, which is
orthogonal to the issue of whether two characters form a grapheme. Although
graphemes shouldn't linebreak, not every pair of letters that disallow linebreak
are graphemes. Moreover, ZWSP is completely unsuited to break graphemes, since it
will allow linebreak in places where it should not occur, such as between "fri" and
"end".

The functions of the ZWNJ and ZWJ are to forbid/encourage cursive connection
(including ligation, as per recent UTC decision). This is also orthogonal to the
issue of whether two letters form a grapheme. Letters that form a grapheme may or
may not have a cursive connection; letters that have a cursive connection may or
may not form a grapheme.

There is a proposal under discussion to add two new characters for controlling
grapheme production, at http://www.macchiato.com/unicode/GraphemeJoiner.html.

That being said, I am extremely wary of the purported general requirement to sort
sequences of letters differently based on the origin of the word, for the same
reasons expressed earlier on this thread. Even if the grapheme join/break or
digraphs were encoded, having the sort order depend on features that are completely
invisible is very dangerous. Maltese dictionaries may use such ordering (an
interesting question is whether they all do or not), but dictionaries possess a lot
more information about words (such as word-origin) than you would want to maintain
in a general sort, for file names, customer names, etc.

There are valid reasons to distinguish graphemes in specialized databases, such as
those for linguistic research. In such circumstances there is a lot more control
over the content of the text. For such purposes, it would probably be to use the
grapheme break for marking the foreign words like "friend" in Maltese, and not
sprinkle the grapheme join throughout all domestic words; probably far fewer words
would be affected.

Since GB and GJ are not yet encoded (and may never be -- it is only a proposal at
this stage) one could use PUA characters for internal purposes in the meantime.

Mark

John Cowan wrote:

> Peter_Constable@sil.org wrote:
>
> > But do "friend" and "frigate" appear in Maltese dictionaries? Is it
> > reasonable to expect that a single collating spec can correctly order words
> > that follow two different collating conventions all at once?
>
> My understanding of the post was that "friend" *is* a Maltese word
> (in the same sense that "résumé" is an English word), but that it does
> not contain the Maltese letter "ie". Therefore, there needs to be
> a way to know when "ie" is to collate as a single letter and when
> it is not to do so.
>
> > Suppose it were the case that Maltese alphabetic order put the letter l
> > before f.
>
> I have a recollection of seeing a list of Chinese words written in pinyin
> but alphabetized according to bopomofo rules. Is this commonplace?
>
> > As a result, if we
> > had a mixture of English (or French - more likely for Benin) and Adja
> > words, we couldn't sort them using a single set of rules and have them come
> > out correct for both languages at once.
>
> To be sure. The trouble arises when the digraph sometimes sorts one way
> and sometimes another. IIRC, in Danish "aa" sorts as "ĺ" when it is an
> archaic rendering of it, as in "Aarhus", but as "aa" when it is a borrowing,
> as in "aardvark". That's the same case as Maltese, no? What is commonly
> done for Danish sorting?
>
> > In general, the best solution is: if a language has borrowings from another
> > language, they take on the conventions of the receptor language.
>
> A good principle, but it doesn't always work in the Real World.
>
> Summary: Of course I agree with you that adding characters is not the answer,
> and that tailored collation sequences mostly are --- but some languages
> may have collation rules that look internally inconsistent when represented
> in Unicode, and may require tricks with ZWNBSP, which I think is the Right
> Thing in this case.
>
> --
>
> Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com
> Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan
> Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT