L2/05-301


Date: October 13, 2005

Title: UCA Default Table Criteria for New Characters

Source: Ken Whistler

Action: For discussion by the UTC


Background

I had an action item (100-A077) to draft a proposal for
criteria for the default UCA table, based on the discussion
that was raised about L2/04-277 -- which covered a lot of
issues for collation.

Part of the issues for the default UCA table (DUCET) have been
separately addressed in the proposal regarding change
management for the UCA. That covered issues of stability
regarding the existing data in the table, and the process
for tracking changes proposed for the table.

There is another bit of unfinished business, however, and
that has to do with the criteria which the UTC might want
to apply in deciding how to create *initial* orderings for
the large collections of new characters added to DUCET
at each minor or major version of the Unicode Standard.

Document L2/04-277 proposed some criteria, but they were
somewhat hard to extract in an actionable way for application
to new collections of characters, because they mixed in
issues regarding stability of the existing weights and
change management issues.

In this document, I have extracted a few ideas from
L2/04-277 and extended them to make a much more explicit
set of proposed guidelines for how to establish initial
ordering for new collections added to DUCET.

************************************************************

Criteria for Ordering New Scripts

1. When a new script is added to the standard, the establishment
of its primary ordering should, as much as possible, be based
on information provided with the Summary Proposal Form and
other supporting documents for the proposed encoding.

2. Failing that, or given ambiguity in the proposal documentation,
primary ordering should be based on whatever lexicographical
evidence can be gathered for the language which is either the
best documented and/or in most widespread use for that script.

3. If a script is in multilingual use and has character extensions
provided for specific languages, then following the choice of
primary order for the first language (by criterion #2), weights
for character extensions should be interpolated so as to get
the ordering for other languages (if known) as much right as
possible without requiring tailoring.

4. If characters with accents are included, then the accents should
be given secondary weights unless overriding concerns based on
established practice for primary letter weighting dictate otherwise.

5. If characters with distinctions comparable to case are included,
then the case (or presentation form) differences should be given
tertiary weights unless overriding concerns based on
established practice for ordering dictate otherwise.

6. Weighting for digits, symbols, and punctuation in a new script
should, as much as possible, follow the established patterns in
the DUCET for other scripts, so as not to introduce idiosyncratic
treatments of such characters on a script-by-script basis.

7. In some instances, particularly for historic scripts, there
may be no established native lexicographical order, or none
documented well enough to be usable. In such cases, a primary
order based simply on code point order in the charts or, alternatively,
based on a well-known academic catalog order for the characters,
may be an acceptable alternative for placing the characters in
the DUCET.

8. The impact on the overall size and complexity of the DUCET
also needs to be considered when adding weights for a new script.
Particularly complex approaches to the specific weighting for
a new script should be eschewed if they would have a significant
impact on the table's use for all other scripts and languages,
even if that approach might produce a marginally better default
ordering for the new script.

****************************************************************

Criteria for Addition of Small Numbers of Characters to Existing
Collections

1. As much as possible, when adding additional characters to
scripts (or other collections) already in the DUCET -- as,
for example, adding small numbers of additional Latin, Cyrillic,
or Arabic characters -- weights for such characters should be
interpolated in the table following the *predominant* principles
of ordering already established in the table for that script.
This is to minimize the chances that such characters will simply
get lost in the table by being ordered in some haphazard, ad
hoc manner for the script. (Thus if a z-like character with
some overlay diacritic is added to the table, it should be
weighted as much as possible like other z-like characters with
diacritics.)

2. In most instances, characters added after the fact for a script,
in support of some small, minority language use or specialized
orthography, will be added in full knowledge that a tailoring
of the DUCET will be necessary in order to support ordering for
that language or specialized orthography. However, in certain,
limited cases, it may be appropriate to attempt to place such
an additional character in a primary order other than would
be chosen by principle #1, if it is known that that character
is used *only* for that language or specialized orthography.
Such exceptions should, however, be just that: exceptions.

3. When additional characters have formal decomposition mappings
in the standard, their ordering weight should simply be
derived automatically from the decomposition, unless there
is a clear, overriding reason to do otherwise. This is because
overriding the decomposition in all cases marginally complicates
the process of regenerating the DUCET, may often introduce
unanticipated edge cases or interactions with other weights,
and seldom is sufficient to produce a "perfect" ordering.

4. Additional sets of punctuation or other symbols that fall
into clear classes that have been grouped together in the DUCET
should be grouped, as much as possible, with like characters
already present in the DUCET. Thus if a new quotation mark of
some sort is added, it should be grouped with the existing batch
of quotation marks in the table. This eases maintenance and will
make sense for some kinds of ordering, even though for most
lexicographical sorting, punctuation and such symbols are basically
ignored.

5. Other symbols should simply default to getting weights based
on the code point order, along with the existing collection of
otherwise unclassified symbols.

.