L2/05-301 Date: October 13, 2005 Title: UCA Default Table Criteria for New Characters Source: Ken Whistler Action: For discussion by the UTC Background I had an action item (100-A077) to draft a proposal for criteria for the default UCA table, based on the discussion that was raised about L2/04-277 -- which covered a lot of issues for collation. Part of the issues for the default UCA table (DUCET) have been separately addressed in the proposal regarding change management for the UCA. That covered issues of stability regarding the existing data in the table, and the process for tracking changes proposed for the table. There is another bit of unfinished business, however, and that has to do with the criteria which the UTC might want to apply in deciding how to create *initial* orderings for the large collections of new characters added to DUCET at each minor or major version of the Unicode Standard. Document L2/04-277 proposed some criteria, but they were somewhat hard to extract in an actionable way for application to new collections of characters, because they mixed in issues regarding stability of the existing weights and change management issues. In this document, I have extracted a few ideas from L2/04-277 and extended them to make a much more explicit set of proposed guidelines for how to establish initial ordering for new collections added to DUCET. ************************************************************ Criteria for Ordering New Scripts 1. When a new script is added to the standard, the establishment of its primary ordering should, as much as possible, be based on information provided with the Summary Proposal Form and other supporting documents for the proposed encoding. 2. Failing that, or given ambiguity in the proposal documentation, primary ordering should be based on whatever lexicographical evidence can be gathered for the language which is either the best documented and/or in most widespread use for that script. 3. If a script is in multilingual use and has character extensions provided for specific languages, then following the choice of primary order for the first language (by criterion #2), weights for character extensions should be interpolated so as to get the ordering for other languages (if known) as much right as possible without requiring tailoring. 4. If characters with accents are included, then the accents should be given secondary weights unless overriding concerns based on established practice for primary letter weighting dictate otherwise. 5. If characters with distinctions comparable to case are included, then the case (or presentation form) differences should be given tertiary weights unless overriding concerns based on established practice for ordering dictate otherwise. 6. Weighting for digits, symbols, and punctuation in a new script should, as much as possible, follow the established patterns in the DUCET for other scripts, so as not to introduce idiosyncratic treatments of such characters on a script-by-script basis. 7. In some instances, particularly for historic scripts, there may be no established native lexicographical order, or none documented well enough to be usable. In such cases, a primary order based simply on code point order in the charts or, alternatively, based on a well-known academic catalog order for the characters, may be an acceptable alternative for placing the characters in the DUCET. 8. The impact on the overall size and complexity of the DUCET also needs to be considered when adding weights for a new script. Particularly complex approaches to the specific weighting for a new script should be eschewed if they would have a significant impact on the table's use for all other scripts and languages, even if that approach might produce a marginally better default ordering for the new script. **************************************************************** Criteria for Addition of Small Numbers of Characters to Existing Collections 1. As much as possible, when adding additional characters to scripts (or other collections) already in the DUCET -- as, for example, adding small numbers of additional Latin, Cyrillic, or Arabic characters -- weights for such characters should be interpolated in the table following the *predominant* principles of ordering already established in the table for that script. This is to minimize the chances that such characters will simply get lost in the table by being ordered in some haphazard, ad hoc manner for the script. (Thus if a z-like character with some overlay diacritic is added to the table, it should be weighted as much as possible like other z-like characters with diacritics.) 2. In most instances, characters added after the fact for a script, in support of some small, minority language use or specialized orthography, will be added in full knowledge that a tailoring of the DUCET will be necessary in order to support ordering for that language or specialized orthography. However, in certain, limited cases, it may be appropriate to attempt to place such an additional character in a primary order other than would be chosen by principle #1, if it is known that that character is used *only* for that language or specialized orthography. Such exceptions should, however, be just that: exceptions. 3. When additional characters have formal decomposition mappings in the standard, their ordering weight should simply be derived automatically from the decomposition, unless there is a clear, overriding reason to do otherwise. This is because overriding the decomposition in all cases marginally complicates the process of regenerating the DUCET, may often introduce unanticipated edge cases or interactions with other weights, and seldom is sufficient to produce a "perfect" ordering. 4. Additional sets of punctuation or other symbols that fall into clear classes that have been grouped together in the DUCET should be grouped, as much as possible, with like characters already present in the DUCET. Thus if a new quotation mark of some sort is added, it should be grouped with the existing batch of quotation marks in the table. This eases maintenance and will make sense for some kinds of ordering, even though for most lexicographical sorting, punctuation and such symbols are basically ignored. 5. Other symbols should simply default to getting weights based on the code point order, along with the existing collection of otherwise unclassified symbols. .