RE: Ordering of scripts in DUCET?

From: verdy_p (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2008 - 21:28:45 CST


> De : "Harold S. Henry" <harold@talerian.com>
> A : unicode@unicode.org
> Copie à :
> Objet : RE: Ordering of scripts in DUCET?
>
>
> Thank you, Mr. Kenneth Whistler and M. Philippe Verdy, for your extremely
> prompt and very informative answers to my question!
>
> My curiosity was prompted by the desire to reproduce the default sort order
> represented by DUCET using sort keys based on script, so that the sort keys
> would continue to be usable under future versions of Unicode. I'm
> contemplating a distributed data environment in which it will be difficult
> to update indices consistently and reliably as Unicode continues to develop.

Note that the absolute values of collation keys in the DUCET are NOT stable. They are renumbered as needed when new
characters are added. What is or should be (approximately) stable is the relative order of scripts (even if there's
no policy for determining how they were initially positioned in the DUCET) and the relative order of characters
within each script.

Also the absolute values are not meant to be universal: when tailoring is used, the gaps between numbers of the
ducet (whose default is to keep one position free between two successive number) may need to be changed.

* A UCA implementation may increase the gaps simply by multiplying the values from the DUCET by a constant; for
example you can multiply the values by 2 to set the default gaps to 4 instead of 2, creating 3 free positions
between two used positions in the DUCET, so that you can reposition more collation sequences between them.

* A UCA implementation that wants to reorder some scripts may need to "slide" all primary collation key elements
within a segment, in order to reposition a script in the middle of the sequence.

* The DUCET provides also some labelled positions that allows "sliding" all scripts at the same time to another
absolute position that separates for example numbers and letters, just in order to move one script (or several
ones) at the initial position, without having to know which letter in which script is the first used position that
needs to be moved (by a simple addition of a constant)

* The computed collation keys are then NOT stable and not portable. They arenot meant to be transported from one
system to another, even if they are based on the same version of the DUCET and even if they are not tailored at
all. The UCA and DUCET is not meant to be handled like a character encoding standard. All precomputed collation
keys are for local use only. The only safe wy to transport text from one system to another is to transport them
encoded as sequences of Unicode code points with some chaacter encoding, and then to compute the collation keys on
the reception site, according to its local implementation of the UCA and DUCET.



This archive was generated by hypermail 2.1.5 : Fri Jan 02 2009 - 15:33:07 CST