L2/06-086

From: Mark Davis
Date: 2006-03-17
Subject: Preferred ordering of marks

Please add the following as a document and on the agenda.

Peter Constable recently proposed documenting ordering information for 
Thai, ie "BelowMarks* AboveMarks*". That is the mechanism we are using 
right now whenever the customary typing order is different than the NFC 
ordering: using a BNF to indicate the ordering.

But this raises a broader issue. We have been accumulating bits of 
documentation in various places about what the preferred ordering is for 
a given script. But having it in documentation *alone* means that 
inevitably programmers will overlook it in their implementations.

Here is a strawman proposal for post 5.0:

Define a new numeric property called 'Preferred_Ordering'. This assigns 
a number, similar to the canonical combining class (CCC), to each 
Unicode character. The preferred order for a Unicode string is found by 
applying the canonical ordering algorithm but using this property 
instead of CCC. The goal is to match the most common typing order for 
complex scripts, for the sequences that are used in practice in a given 
script. As opposed to the CCC, this property will never be required to 
be stable; it can be adjusted as new information comes in.

ISSUES

1. Is this algorithm adequate? If a script had some rules with repeats, 
like the following BNF, one would need a more complicated algorithm.

<consonant> <vowel>* <tone_type1>* <vowel>* <tone_type2>*

2. Is the customary preferred typing order dependent on language? For 
example, if the Thai preferred ordering is as above, but a minority 
language using the Thai script had the reverse, then the ordering would 
be language-dependent. If so, then the information would be more 
properly part of CLDR instead of the UCD.