Public Review Issues

Accumulated Feedback on PRI #235

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Nov 26 13:17:14 CST 2012
Contact: yoshito_umaoka@us.ibm.com
Name: Yoshito Umaoka
Report Type: Error Report
Opt Subject: UTS#10 - 1.3 Contextual Sensitivity


UTS#10 Unicode Collation Algorithm - 1.3 Contextual Sensitivity
[http://www.unicode.org/reports/tr10/#Contextual_Sensitivity] contains
following description:

--------------------
Both contractions and expansions can be combined:
that is, two (or more) characters may sort as if they were a different
sequence of two (or more) characters. In the third example, for
Japanese, a length mark sorts like the vowel of the previous syllable:
as an A after KA and as an I after KI.
--------------------

Also, Table 4. Context Sensitivity contains an example of above:

--------------------
カー < カイ, but
キー > キイ
--------------------

The description "as an A after KA and as an I after KI." is correct,
but example is bad. "カー < カイ" is correct, but "キー > キイ" is incorrect.

カー is an contraction and equivalent to カア at level 1, therefore カー < カイ is
correct. キー is also an contraction and its primary weight is equivalent to キイ,
but it is still less than キイ at lower level.

To illustrate the nature of contraction followed by expansion, 
the example should be like:

--------------------
カー < カア < カイ, but
キア < キー < キイ
--------------------

or, simply

--------------------
カー < カア, but
キア < キー
--------------------

Feedback above this line was reviewed at UTC #134, Feb 2013.

Date/Time: Wed Feb 13 15:49:25 CST 2013
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: UCA conformance test invalid


The UCA conformance test is fundamentally invalid!

The conformance test purports to test an implementation of the UCA by
checking its results when using DUCET.  The validity of this test is
presumably based on Conformance requirement C1 Paragraph 1: 'Given a
well-formed Unicode Collation Element Table, a conformant
implementation shall replicate the same comparisons of strings as those
produced by Section 4, Main Algorithm.'  However, as stated in Section
3.6.4, DUCET is not well formed.  Therefore, a conformant
implementation of the UCA is free to use whatever collation it likes
when instructed to use DUCET!

These problems also apply to Draft 1 of DUCET 6.3.0 and UCA Version
6.3.0 Draft 2.

So far as I am aware, the only well-formedness requirements needed for
the collation algorithm to work are WF3 and WF4.  I see three options:

1) Fix DUCET to make it satisfy WF5 (already rejected by the UTC).
2) Remove WF5.
3) Reword requirement C1 Paragraph 1 to explain what well-formednesses
may not be relied upon by a compliant implementation.

Date/Time: Tue Mar 12 17:19:18 CDT 2013
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 235 / Problems with UTS#10, UCA


Only the first comment strictly related to PRI 235, except in so far as the
other comments could be taken as referring to missed opportunities to correct
Version 6.2.0.

1 .If the LDML specification is to be updated at the same time as this
document, the reference from Section 5.1 to LDML Section 5.14.3 will break,
for that section will be moved to a different file.

--------

2. Last sentence of 3.6 'All collation elements with primary weights from 1 to
that maximum are variables; all other collation elements are not' is wrong.
This is false for the CLDR root collation and all tailorings thereof
conforming to CLDR rules.  Suggest rephrasing paragraph as,

'Primaries for variable collation elements are *barely* interleaved with other
primary weights. This allows for more compact storage of memory tables.
Rather than using a bit per collation element to determine whether the
collation element is variable, the implementation only needs to store the
*minimum and* maximum primary value for all the variable elements. All
collation elements with primary weights from *that minimum*  to that maximum
are variables; all other collation elements are not.'

Changes are marked by *...*.

------

3. In proposed 9.1 / old 3.6.1, the text says that there may be lines for
'parametric attributes'.  However, the BNF provides no productions for them,
and it is not clear to me what these attributes may be.  Are they just the
forward/reverse ordering at level 2 and the  form of variable weighting
employed, or do they include the full parametric tailoring?  Section 3 list
things that an abstract  'collation element table' contains, but is it
exhaustive and is it meant to include level 2 ordering and variable weighting
scheme?

---------

4. A significant tailoring worth mentioning in Section 5 is removing specific
contractions.  It is an extremely common optimisation in the search collations
of the CLDR.

---------

5. In Section 5, changing the secondary level direction and variable weighting
options should only be described as changing the collation element table if
these settings are part of the abstract collation element table.  The former
does not change the mapping from NFD strings of codepoints to strings of
collation elements, and if the latter does, it then creates mappings for the
infinite number of sequences of spaces and punctuation followed by one or more
Latin script diacritics.

---------

6. Contrary to Section 5, turning off normalisation does not transform UCETs.
In so far as it changes the collation, the collation has ceased to be an
application of the UCA, which requires that canonically equivalent strings
collate as equal.

---------

7. If applying the LDML parametric tailoring numeric=”true” is indeed a
tailoring in the UCA sense, then the mapping part of a well-formed  UCET may
be infinite.  This has implications for Requirement C1.  When is a conformant
implementation of the UCA required to handle infinite UCETs?  I'd suggest it
only be required to do so when the UCET is an LDML parametric tailoring  of
DUCET or the root CLDR collation.  (Obviously it must handle a well-formed
infinite UCET correctly or simply fail to handle it.)

Date/Time: Wed Mar 13 14:20:55 CDT 2013
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: UTS#10 - Additional contractions in DUCET


At the start of paragraph 5 of Section 3.8 of Draft 4 of Version 6.3.0 of
UTS#10, the Unicode Collation Algorithm, it says 'Those are the only classes
of contractions allowed in the Default Unicode Collation Element Table.'.  It
has said this for some time.  However, it is not a true statement.  There are
also 6 NFD and 2 non-NFD contractions for the compatibility decompositions of

U+013F LATIN CAPITAL LETTER L WITH MIDDLE DOT,
U+0140 LATIN SMALL LETTER L WITH MIDDLE DOT,
U+0E33 THAI CHARACTER SARA AM,
U+0EB3 LAO VOWEL SIGN AM,
U+0F77 TIBETAN VOWEL SIGN VOCALIC RR, and
U+0F79 TIBETAN VOWEL SIGN VOCALIC LL.

A simple remedy would be to change 'contractions are necessary where a
canonical decomposable character requires a distinct primary weight in the
table' in paragraph 3 to 'contractions are necessary or desirable where a
decomposable character requires a distinct primary weight in the table'.

Date/Time: Wed Mar 13 17:57:57 CDT 2013
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 235 - UTS#10 (for Unicode 6.3.0)


These comments are made on Draft 4 of Version 6.3.0.

1) There is no upper bound on the values of weights in a collation element table.  
Formerly (Version 6.1.0) it was FFFF, though this was breached by the now discarded 
fourth weight in DUCET.  Consequently, the following changes are needed:

a) In Section 3.6 'Variable Weighting', it should be noted that FFFF denotes an 
arbitrarily selected value greater than the primary weight of any variable 
collation element.

b) In Section 6.1.2 'L2/L3 in 8 bits', there should be a note that the same 
number of bytes should be used for each L1 weight.

c) In Section 6.2 'Large Weight Values', the starting premise is no longer 
necessarily true.  The opening sentence could be reworded to:

'Some old implementations may not support more than 65,5345 weight values 
(or 65,024 values where zero bytes are avoided.  A need for more weight values 
can still be accommodated...'

The example from DUCET may need to be reworded if the suggestion in point 
(e) is taken up.

d) In 6.4 'Avoiding Zero Bytes' it should be noted that analogous preprocessing 
is available for weight ranges larger than 16 bits.

The following opportunities present themselves:

e) Implicit weights (Section 7.1.3) no longer need to be generated as two 
elements.  With suitably revised values of WEIGHT,  a single primary weight 
of BASE + CP could be chosen instead.

f) The principal of a fixed range for trailing weights (Section 7.1.4) 
makes no sense if there are no bounds on primary weights.  Indeed, it makes 
sense for the implicit weights, trailing weights and weights reserved for 
special collation elements to be aspects of the collation element, especially 
if 2nd level order and variable weighting are part of the collation element table.

2) There are a few faults in the code example in Section 6.10:

a) The code assumes that primary values FFE0..FFFF are not allowed.  
However, U+FFFD has primary weight FFFD and the CLDR assigns primary 
weight FFFF to U+FFFF.

b) Variable weights have an upper and a lower limit, not just an upper limit.

Date/Time: Wed Mar 13 18:06:54 CDT 2013
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: UTS#10 - Error in 'Escape Hatch'


The fault reported below has been present in the UCA since at least Version
6.1.0.

The section entitled 'Escape Hatch' suggests one can encode a large number of
secondary differences by varying z and n in 2-element collation keys of the
form

 [.yyyy.00zz.00ww] [.0000.00nn.0001]

However, the relative orderings of single collating elements would not be
stable if the French accents (= level 2 ordering) setting were toggled.  To
provide stability, it is necessary to use a palindromic 3-element collation
key of the form

 [.yyyy.00zz.00ww] [.0000.00nn.0001] [.0000.00zz.0001]

Date/Time: Sat Mar 16 07:51:47 CDT 2013
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: UCA (UTS#10) Conformance Requirements


The definition of the 'standard UCA parametric tailoring' is subcontracted to
the LDML specification (UTS#35).  An implementation claiming conformance to it
should therefore specify the relevant version numbers – LDML and possibly CLDR
for data held in FractionalUCA.txt, which is no longer identified by UCA
version.  For example, there is a proposal to change the meaning of the
variableTop parameter (http://unicode.org/cldr/trac/ticket/5016 Comment 8),
and I believe it will be accepted.  I'm not sure whether this belongs in
Requirement C4 or C6.

The standard UCA parametric tailoring parameter 'reorder' is in general
undefined – see last paragraph of
http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html Section
3.12 or, for the latest released version, Section 5.14.12 of UTS#35 Version
22.1.  The text of the LDML specification says, 'The reordering groups for the
DUCET are not specified here'.  For the LDML, this has also been raised as
http://unicode.org/cldr/trac/ticket/5813 .