Accumulated Feedback on PRI #223

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Fri Jul 6 19:19:36 CDT 2012
Contact: kenw@sybase.com
Name: Ken Whistler
Report Type: Public Review Issue
Opt Subject: PRI #223


UCA does not require specific behavior for when the algorithm 
encounters ill-formed data (e.g., isolated surrogates in UTF-16 
strings). A conformant implementation may, for example, throw 
an exception when it encounters ill-formed input. However, the 
conformance test data files include isolated surrogates in some 
of the test cases. In order to pass the conformance tests as 
written, an implementation *must* adopt a particular strategy 
and return particular values for ill-formed strings. (It can 
pass by weighting an isolated surrogate as it would an unassigned 
code point.) This anomaly should be documented in the test 
documentation, as a conformance test should not force a 
requirement on an implementation that the conformance 
requirements for the algorithm do not actually state. At least 
implementers using the conformance tests should be put on fair 
notice about this situation in the test data.

Date/Time: Sat Jul 7 17:06:02 CDT 2012
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI #223: Proposed Update UTS #10: Unicode Collation Algorithm


Section 3.1:

Because D1 to D5 run contrary to normal English transformational
grammar, thereby impeding understanding,there should be a warning such as,

"Note that although a level 3 ignorable is ignorable at level 2, it is
not a level 2 ignorable."

Section 3.3.2:

In the new text in Section 3.3.2, I regret that 'completely ignorable'
should be replaced by 'sufficiently ignorable'.  The inserted character
will give a different result to the plain characters with the
contraction removed, be it only at the semistable level.  The problem
with the text as it stands is that CGJ maps to a quaternary element in
DUCET.

Section 3.6.1:

The statement "the UCA does not use this fourth level of data" is
wrong.  The UCA uses whatever levels are provided to it, are within the
implementation's capability (C2 requires at least 3), and are not
otherwise disabled. I suggest, "the UCA does not require the use of
this fourth level of data".

One cannot state that the values in the fourth level are "not
consistent with a well-formed collation element data table" until
well-formedness condition 2 is strengthened.  By the definition of
6.2.0 Draft 3, DUCET is well-formed even with the level 4 weights.  

The statement "If the first three levels are zero, the fourth level is
also set to zero" is false.  The simplest repair I can think of is to
substitute, "For further details, see Section 7.3, Fourth-Level Weight
Assignments".

Section 3.6.2 (or revised allkeys.txt):

If IgnoreSP is selected, is U+10A7F OLD SOUTH ARABIAN NUMERIC INDICATOR
variable or not?  It is variable under DUCET.  It is ordered within the
variably weighted numbers within the symbols, but the general category
Po.  This issue applies both to Unicode 6.1.0 and to
allkeys-6.2.0d2.txt with UnicodeData-6.2.0d1.txt.  U+10A7F is the only
character for which this quandary arises.

Section 4.2 S2.1

The draft dated 17 May of 2012 of the Minutes of the UTC 131 / L2 228
Joint Meeting San Jose, CA -- May 7-11, 2012 record no agreement to the
change.  The two relevant paragraphs from L2/112 are:

[131-C10] Consensus: Adopt the recommendation for requiring prefix
contractions as in document L2/12-131R, with a change to 2A that it
only applies to contractions ending with a non-starter. For Unicode
version 6.2.

[131-A34] Action Item for Mark Davis, Editorial Committee: Add text to
proposed update of UTS #10 with a review note with some text from
L2/12-131R.

However, Mark Davis reports receiving a substantially different version
of the action item.

One of the arguments for simplifying the processing was that so doing
avoided the need to start processing from a buffer of characters and
then continuing with the input string, which could be coming from a
data stream.  However, as the proposal to require prefixes for all
contractions was rejected, similar processing is required even when
there are no non-starters.  For example, consider the processing for a
string "abcdgh" when there are contractions for "ab", "abcde" and "dgh".
As a simple example, sorting Russian transliterated to English
conventions according to Russian sorting rules would require
contractions for "sh" and "shch", but not "shc".

In real text, examples where the algorithm change would cause problems
are few and far apart, but there are some potential examples in the
Tibetan and Tai Tham scripts.  They would be occasioned by U+0F39
TIBETAN MARK TSA -PHRU, a consonant modifier which gets positioned
after the vowel(s), and by U+1A60 TAI THAM SIGN SAKOT, which can be
separated from the following consonant by a tone mark.  I am currently
seeking evidence of actual rather than potential problems.

Section 4.5

Well-formedness conditions 3 to 5 are not essential to the UCA; they
serve only to allow certain code optimisations.

To comment well on condition 2 I need a technical term, taken from
ISO 14651, which if adopted could be added to the end of Section 3.1 as:

"D9. The character or sequence of characters mapped by a collation
element mapping is a _collating element_."

One could replace paragraph 1 "Only well-formed weights are allowed..."
by the following:

"The process of forming a sort key includes mapping the
string into a sequence of collating elements and then into a sequence of
collation elements, discarding collation elements that are ignorable at
the relevant level.  The process of discarding zero weights when
forming the sort key threatens to break this correspondence.
Well-formedness conditions 1 and 2 are the conditions necessary to
preserve this correspondence."

The example given for well-formedness condition 2 is wrong:

(a) By well-formedness condition 2, 'b' shall have a secondary weight
that is less than the secondary weight of a non-spacing grave!

(b) For the stated ordering to result to hold, it is necessary that the
secondary weight of non-spacing grave be less than the secondary weight
of 'c'.  This is the violation of well-formedness condition 2!

Well-formedness condition 4 is explained at the end of Section 3.6.2 as
a storage optimisation.  Well-formedness condition 3 is presumably
similarly a storage optimisation.  What is not explained is why
such an an optimisation does not cause problems.  Presumably
well-formedness condition 4 works because the characters to be
(partially) ignored are similar in effect to have nothing, and adding
material at the end of a string makes it sort later.  It might be
adding words to this effect.

Well-formedness condition 5 should also be explained, for the
restriction to contractions ending in non-starters is peculiar (and
greatly weakens the benefit of the condition).  I suggest adding:

"Step S2.1 may be implemented by considering both a definite
initial substring S1 which has a match and a longer initial substring
S2 which is the initial substring of a string with a match.
Characters are added to S2 while possible, with S1 becoming S2
whenever it has a match.  In the logic of steps S2.1.1 to S2.1.3, only
substrings that have matches are considered.  When such an
implementation is used, well-formedness condition 5 allows a program to
move from the logic of Step S2.1 on encountering a non-starter rather
than waiting until encountering a character which cannot be added to
S2."

Sections 5 (intro) and 6.5.1

We seem need three levels of normalisation support if we are to have
proper separation of the collation capability:

off: Proper behaviour requires NFD input.
FCD: Proper behaviour requires FCD input.  (An implementation of
collation will have to examine the collation element tables to
determine what partial normalisation it needs to do to satisfy the
requirement.)
on: Proper behaviour whatever the input.

Depending on UTS#35, implmentations claiming compliant
parametric normalisation tailoring are prohibiting from offering such a
split via the normalisation parameter!


Date/Time: Thu Jul 19 14:43:18 CDT 2012
Contact: richard.wordinghan@ntlworld.com
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI #223: Proposed Update to UTS#35 LDML


Section 5.14.3 Numeric:

It is good to have a definition of the location of the primary weights of the
decimal digit sequences.

Section 5.14.3 Alternate:

The new text says that "shifted" and "ignoresp" are synonymous.  Is that
intended?  I presume the difference between UCA "shift" and "ignoresp" is to
be handled by the variableTop setting.  Or is “ignoresp” meant to handle
discontiguous ranges of variable weights?  Discontiguity arises from the
ordering of the punctuation character U+ 10A7F OLD SOUTH ARABIAN NUMERIC
INDICATOR, which in the “ducet” collation is ordered between numbers which are
not 'decimal digits'.

Section 5.14.13 Case Level:

At present, cases for characters are currently derived from the tertiary
weights using the information in UTS#10 Section 7.3.  Weights are treated as
upper case if recorded as upper case or normal or narrow kana.  It is only in
the case of contractions created by tailorings that derivation rules are
missing.  Thus an application can currently support the case tailorings
(though not 'rules') on the basis of UnicodeData.txt and one of allkeys.txt,
allkeys_CLDR.txt and FractionalUCA.txt.  Fuller support of Unicode rules will
now be required for the implmentation of case tailorings.

If the procedure is only intended to apply to contractions created in the
'rules' (by <p>, <s>, <t>,<q> and their derivatives), then the process is
clear enough, but such a restriction should be stated.  In this case, it
should also be stated whether it applies to <i>, or whether <i> preserves case
modifications.

If the procedure is to be applied more widely, then presumably it applies to
all mappings, including contractions and formal expansions.  Does it apply to
expansions in tailorings for the expansion part, or do the character added in
the expansion retain their original case mapping properties?  For example,
would &c <<< k/H result in 'k' having two mixed case collation elements or a
lower case and an upper case collation element?  Would &h <<< C | hh result in
Chh having two mixed case collation elements or an upper and a lower case case
element?

The list of upper exceptions should be given in terms of code points just as
the list of lower exceptions is.

At present, U+00D8 LATIN CAPITAL LETTER O WITH STROKE is collated, to the
first three levels, identically to the sequence <U+004F LATIN CAPITAL LETTER O
U+0338 COMBINING LONG SOLIDUS OVERLAY>.  If the change applies to formal
expansions, they will no longer be collated identically when case ordering is
enabled or a case layer is inserted, for both collation elements of U+00D8
will be upper case but <U+004F, U+0338> will have one upper and one lower case
collation element.  This would appear to be unintended.  It may be possible to
fix this problem by changing the derived collation elements for secondary
elements from 0.s.ct and 0.s.c.t to 0.s.1t and 0.s.1.t, but this seems very ad
hoc.

If the procedure is to be applied to mappings already in the UCA collation
tables, it will change the casing of circled and squared katakana, such as
U+32D0 CIRCLED KATAKANA A and U+1F213 SQUARED KATAKANA DE, which are currently
treated as lower case.  After the change, they will be treated as upper case.
Is this change intended?

The weights given for tertiary elements produce an ill-formed collation
element table.  Note that the normal DUCET tertiary weights cannot be applied
to tertiary elements, for so doing would produce an ill-formed collation
element table.  (DUCET has no tertiary elements, while the CLDR root locale
collation has exactly one if one believes allkeys_CLDR.txt is wrong.)  The
modification to 0.0.ct needs to be changed to a modification to 0.0.(c+3)t –
the third weight must be greater than cu for any u that is the third weight of
a primary or a secondary element.

The modification to 0.0.c.t needs to be replaced for the same reason.  To
motivate the replacement, I considered a tailoring &\u0000 <<< ch | h and the
strings 'chan', 'chhaN', 'chhan' and 'chaN'.  (The motivation for the example
is that Indic CHA is sometimes transliterated as 'chh'.)  If we insert a case
level and allow the 0.0.c.t weight, we get the ordering 'chan', 'chhan',
'chhaN', 'chaN', typical of an ill-conditioned table.

If we replace 0.0.c.t by 0.0.(c+3).t, we get 'chan', 'chaN', 'chhan', 'chhaN'.
If we replace 0.0.c.t by 0.0.0.t, we get 'chan', 'chhan', 'chaN', 'chhaN'.

Another tailoring with a similar effect would be &h <<< c | hh, we would also
get 'chan', 'chhan', 'chaN', 'chhaN'.  I therefore recommend replacing 0.0.c.t
by 0.0.0.t.

Date/Time: Wed Jul 25 14:48:29 CDT 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: PRI #223, UCA 6.2: diffs DUCET-CLDR


At the end of UTS #10 (UCA) section 3.6 DUCET there is some text describing
how the CLDR root collation differs from the DUCET. It would be cleaner to
move that text elsewhere, probably into CollationAuxiliary.html.

The current draft has this text:

"Note also that [CLDR] tailors general symbols be classified with the regular
"groups, not the variable groups, using the IgnoreSP option. CLDR also adds
"tailorings of two special values:

The code point U+FFFF is tailored to have a primary weight higher than all
other characters. This allows the reliable specification of a range, such as
“Sch” ≤ X ≤ “Sch\uFFFF” to include all strings starting with "sch" or
equivalent. The code point U+FFFE produces a CE with special minimal weights
on all levels, regardless of alternate handling. This allows for Merging Sort
Keys within code point space. For example, when sorting names in a database, a
sortable string can be formed with last_name + '\uFFFE' + first_name. These
strings would sort properly, without ever comparing the last part of a last
name with the first part of another first name. So as to maintain the highest
and lowest status, in CLDR these values are not further tailorable, and
nothing can be tailored to have the same primary weights."

We should keep the last line of section 3.6 where it is:

"For most languages, some degree of tailoring is required to match user
"expectations. For more information, see Section 5, Tailoring.

Date/Time: Fri Jul 27 16:39:06 CDT 2012
Contact: kenw@sybase.com
Name: Ken Whistler
Report Type: Public Review Issue
Opt Subject: PRI #223 CollationTest.html format issues


The script generating the UCA CollationTest data file has
apparently got bugs in it.

Item #1:

Look, for example, at entries like:

1D1BB 0334 1D16F

That is then generating the UTF8 for the display characters wrong,
ending up with a '\uD834' entry, and also ends up with the wrong
character name, a code point label: <surrogate-D834>

I'm guessing there is some bad interaction between whatever the
script may be doing to eliminate the repetitive listing of second
elements that get repeated a zillion times, like the question marks
and exclamation points, and what is happening for the entries which
aren't just the repetitive cp 003F, cp 0334, cp 0021 type.

Item #2:

Also, for the SHIFTED files, there aren't any quaternary
weights in the sort key representation, which seems incorrect to me.

Item #3:

CollationTest.html also doesn't show that the code point (or code
point sequence, if not abbreviated) is listed in parentheses between
the "#" and the character name(s), or what the conventions are
for abbreviation of sequences. (This is not a bug in the script for 
generating the CollationTest files, but is related to the format issues.)


Feedback received after closing date:

Date/Time: Tue Aug 21 14:15:53 CDT 2012
Contact: cdutro@twitter.com
Name: Cameron Dutro
Report Type: Other Question, Problem, or Feedback
Opt Subject: Clarifying French Backwards Accent Sorting in TR-10


The TR-10 document is written as though French backwards accent sorting 
applies to all French dialects, when in reality it only applies to Canadian 
French.  Can the document be updated to mention this fact?  Relevant tickets: 
http://unicode.org/cldr/trac/ticket/2905 and http://unicode.org/cldr/trac/ticket/2984.
Thanks!