L2/99-314

US comments to the ballot of the 3^rd FCD 14651 – International string ordering

October 10, 1999

The US votes NO on the 3^rd FCD 14651, but will gladly change the vote to YES, if the comments below are accommodated.

1. Technical Comments

p. 1, NOTE 2. This note references the Unicode Standard Version 2.1, but

the appropriate reference occurs neither in the Normative References

nor in the Bibliography. We suggest that the appropriate reference for

the Unicode Standard, Version 2.1, be added to the Bibliography.

p. 4, definition 4.16. This definition is incomplete in the text and must

be fixed.

p. 5, NOTE 1. This note refers to Unicode normalization, but the appropriate

reference occurs neither in the Normative References nor in the

Bibliography. We suggest that the appropriate reference for

Unicode Technical Report #15, Unicode Normalization, be added to

the Bibliography, and a more complete reference be added at this

note.

p. 9, BNF syntax. The "line_completion" tokens in the production rules

for order_start, order_end, reorder_section_after, reorder_after,

and reorder_end should be removed. They are redundant with the

line_completion token in the production rule for tailoring_line.

p. 14, NOTE. This note refers to the Unicode collation algorithm, but the

reference occurs neither in the Normative References nor in the

Bibliography. We suggest that the appropriate reference for

Unicode Technical Report #10, Unicode Collation Algorithm, be added to

the Bibliography, and a more complete reference be added at this

note.

1.1. Technical Changes to Annex A -- Common Template Table

1.1.1. Fixes for Thai

To match cultural expectations for a correct Thai sort, the

following changes should be made to the Thai entries in the

Common Template Table. Incidentally, these changes will put

the Common Template Table in synch with the principles explained

in Annex B.4

a. The Thai vowel indicator U+0E47 THAI CHARACTER MAITAIKHU

should be treated exactly like the Thai tone marks, rather than

being given a primary weight as for other Thai vowels. This implies

that:

i. collating symbol <D0E47> for THAI CHARACTER MAITAIKHU be

added just before the collating symbol <D0E46>.

ii. a weight entry for THAI CHARACTER MAITAIKHU be added:

<U0E47> IGNORE;<D0E47>;<MIN>;<U0E47> just before <U0E46>.

iii. the current weight entry for THAI CHARACTER MAITAIKHU be

removed from the table.

b. U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM should

be treated as units, rather than as combinations of the weights for

the NIKHAHIT and the vowel SARA AA. This implies that:

i. the current weight entry for THAI CHARACTER SARA AM be changed to

<U0E33> <SE20>;<BASE>;<MIN>;<U0E33> % THAI CHARACTER SARA AM

ii. the current weight entry for LAO VOWEL SIGN AM be changed to

<U0EB3> <SE4F>;<BASE>;<MIN>;<U0EB3> % LAO VOWEL SIGN AM

c. The change for MAITAIKHU impacts the autogenerated primary weight

symbols, so the table should be regenerated to correct the resulting

sequence of primary weight symbols.

1.1.2. Fixes for archaic Greek letter case

The third-level weights for several archaic Greek letters

that have no case pairs in the Unicode 2.1 repertoire were misassigned

to <MIN> instead of <CAP>. Those should be corrected. (Note that the

lowercase correspondents of those letters were added by 10646 amendment

Amendment 30, and will appear, appropriate weighted in future revisions

to the 14651 Common Template Table, so the uppercase forms currently in

the table should be correctly weighted.)

Affected characters are:

<U03DC> GREEK LETTER DIGAMMA

<U03DA> GREEK LETTER STIGMA

<U03DE> GREEK LETTER KOPPA

<U03E0> GREEK LETTER SAMPI

1.1.3. Case fix for Palochka

As for the 4 Greek characters, one Cyrillic character with no case pair

should have its third-level weight corrected from <MIN> to <CAP>:

<U04C0> CYRILLIC LETTER PALOCHKA

1.1.4. Misuse of symbol <BLANK>.

The following two lines at the end of the table:

<U4E00>..<U9FA5> <S4E00>..<S9FA5>;<BLANK>;<MIN>;<U4E00>..<U9FA5> % Han

% <UAC00>..<UD7A3> <SAC00>..<SD7A3>;<BLANK>;<MIN>;<UAC00>..<UD7A3> % Hangul

have an undefined symbol <BLANK> in them. That should be corrected to

use the symbol <BASE>, which is otherwise used in that position in the

table:

<U4E00>..<U9FA5> <S4E00>..<S9FA5>;<BASE>;<MIN>;<U4E00>..<U9FA5> % Han

% <UAC00>..<UD7A3> <SAC00>..<SD7A3>;<BASE>;<MIN>;<UAC00>..<UD7A3> % Hangul

1.2. Technical Issue, Annex B.5 Cyrillic

The U.S. would strongly object to the inclusion of the B.5 tailorings

for Cyrillic into the Common Template Table for the following

reasons:

1. To do so would very significantly complicate the autogeneration

of the Common Template Table, which will be a maintenance and

quality problem for future editions of 14651 that add more

characters.

2. Adding this material to the Common Template Table would

introduce baseform + combining mark weightings into the

CTT, something that is currently not required, but which

would significantly increase the complexity of implementations of the

table before tailorings. (That would be an additional

implementation penalty to be carried around by all implementations,

including those which are not primarily concerned with Cyrillic.)

3. The actual tailorings required for Russian are quite

a bit less than that indicated in Annex B.5. Common

Cyrillic requires only slightly more. Only a full tailoring

for all Cyrillic extensions requires addition of all

the information of Annex B.5.

Our preferred solution for this issue is to retain B.5 as an annex

describing Cyrillic tailoring, but to divide it up into three

parts, to show the Russian, the Common Cyrillic (i.e. Serbian,

Macedonia, Bulgarian, Byelo-Russian, Ukrainian) tailoring, and

the extended Cyrillic tailoring. This will make it clear that

the tailoring required for Russian, for example, is no more

formidable than the Canadian tailoring of Annex B.1.

1.3. Technical Issue, Annex E

The U.S. objects to the inclusion of this Annex, which is an

attempt to reinject a dependency between 14651 and PDTR 14652,

from which most of the text for Annex E derives.

The inappropriateness of the addition of this material here is

illustrated by the fact that it includes a number of editorial

and other errors that the U.S. committee has commented on in

the context of ballot comments on PDTR 14652. By replicating

that material into an Annex in 14651, those errors would need

to be corrected once again in this text, with allowances

for the edited down version of the text that appears in Annex E.

Furthermore, the suggestions made in Annex E change the

syntax of at least one keyword in ways incompatible with

that described in the normative BNF of Section 6.3 of 14651

(viz. order_start). This might be appropriate in PDTR 14652, but

is not appropriate in an informative annex to 14651 itself, since

it is more likely to just confuse rather than elucidate there.

This problem is not fixed simply by labelling Annex E

"informative". Annex E should be removed entirely, with the

focus being on the correction of its corresponding content in

PDTR 14652, rather than to try once again to hitch 14652's

wagon to 14651.

If WG20 cannot reach consensus regarding the removal of

Annex E, the U.S. delegation will provide a long list of

suggested editorial changes to make its inclusion less

objectionable in the context of 14651.

2. Editorial Comments

p. iv. 2nd paragraph. result ==> resultant

p. 1, 2nd paragraph. "two characters strings" ==> "two strings"

p. 4, definition 4.8. remove extraneous "-" in definition

p. 4, section 5, first paragraph. "(followed by exact location of

syntax)" is apparently incomplete. This should, presumably

constitute a reference to Amendment 9, which should then also

be included in the normative references for 14651.

p. 5, 1st paragraph. Remove extra quotation mark at end of the

paragraph.

p. 7, section 6.2.2.1. Correct the line break and style for this

section header.

p. 13, NOTE to I6. I1 and I2 should be corrected to I4 and I5,

respectively.

p. 15, NOTE. "too long comments" ==> "long line lengths"