re: Ordering of scripts in DUCET?

From: verdy_p (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2008 - 21:12:08 CST


"Kenneth Whistler" <kenw@sybase.com> wrote:
> Philippe noted:
>
> > There's at l east a document structure by type of entries:
> > - ignored
> > - ignorable
> > - diacritics
> > - symbols and punctuation
> > - numbers
> > - letters and alphabets.
>
> And that is certainly the case.
>
> > But I've not seen a clear statement about the order of letters
> > (or of numbers) according to the script to which
> > they belong. From what I've seen, it looks like scripts are
> > ordered by the first Unicode/ISO 10646 character block
> > in which they appear for the first time, and so you could
> > "predict" the layout for future scripts as being more or
> > like what is displayed as a preview in the Unicode Road Map.
>
> That really isn't the case, as I indicated in my response.
>
> >
> > But there may exist other reasons why this order would not be kept:
> > it is more important to keep the DUCET with
> > their scripts ordered in a way that is consistant with at least
> > one of the major languages that use this script. So
> > the effective order is based on what is expected for collating
> > this primary language, in order to minimize the
> > number of tailoring rules needed for supporting that language
> > in its primary collation order (that other languages
> > will simply borrow by default, simply because they don't regulate
> > these other scripts).
>
> But I think at this point, Philippe has moved on to a different
> issue, irrelevant to the question that Henry had asked. What
> Philippe is talking about here is the primary order of characters
> *within* a script in the table.

Wrong guess. I was effectively refering to the stability of collation of ALL characters (or sequences)
independantly of the script to which they may belong.

Yes it was RELEVANT to the question because it does not only include the stability within the same script, but
*also* the stability of relative ordering of scripts within DUCET (even if, for most applications, it does not
matter much)

Why? Simply because some countries are using multiple scripts, including for the same language. And they are
ordering things like people names, toponyms, book titles, and so on that are using distinct scripts. It may even
happen that some strings that need ordering also use characters from several scripts at the same time, sometimes in
the same "word" (Japanese is a good example).

One the opposite, the need for stability of collation for multi-script texts can be mitigated by the fact that some
of those locales using them would probably need to tailor explicitly their collation, in order to explicitly set
the relative ordering of scripts (and possibly to "merge" several scripts within their collation order. May be some
Japanese applications are doing it for merging Hiragana and Katakana in the same sequence. This could also happen
in India when merging the collation orders of the various Brahmic scripts (so that a Ka in Hindi will sort sort
identically with all letters Ka in Devanagari, Gurmukhi, and so on... (ordered according to the ISCII standard) For
such thing, a collation tailoring may be required and, if this happens, the stability (or not) of the DUCET default
order of scripts will not matter.



This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST