L2/06-262

Subject: Comments on draft-newman-i18n-comparator-12.txt
From: Mark Davis
Date: 2006-07-28

The following are my substantive comments on draft-newman-i18n-comparator-12.txt. These, plus editorial comments, have been sent to the authors.

In addition, I recommend that we add a section to UCA (UTS#10) explaining collation string canonicalization.
 

...
3.5. Naming Guidelines 

While this specification makes no absolute requirements on the
structure of collation identifiers, naming consistency is important,

so the following initial guidelines are provided.

Collation identifiers with an international audience typically begin with "i;".

I'd recommend against this. It makes the name a bit clunker, and doesn't add much value

Collation identifiers intended for a particular language
or locale typically begin with a language tag [5] followed by a ";". 

This is really problematic. CLDR has about 400 language versions. Each of those can vary by version of CLDR (1.0, 1.1, 1.2, 1.3, 1.4, so far). They can also vary by versions of UCA, the underlying ordering, which can be 4.0, 4.1, 5.0, so far.
Now, we might not need to register all of the historical combinations of versions, so that might be only say 20 combinations, but multiply that by 400 (and still growing), and you get thousands of registrations.
The language really needs to be a *parameter*, NOT part of the name.

The document talks in this section in terms of segements separated by ";". This really should be reflected
in the syntax. That is, it should be something like:

collation-char = ALPHA / DIGIT / "-" / "."
collation-id = *1("+" / "-") name *(";" options)
options = variable *1("=" value)
name = *1collation-char
variable = *1collation-char
value = *1collation-char

The limit of 253 bytes can be specified in the text (see, for example, RFC3066bis)

Note: I think the term "parameter" would be better than variable.

...
After the first ";" is normally the name of the general collation
algorithm, followed by a series of algorithm modifications separated
by the ";" delimiter. Parameterized modifications will use "=" to 

delimit the parameter from the value. The version numbers of any lookup tables used by the algorithm SHOULD be present as parameterized modifications. Collation identifiers of the form *;vnd-
domain.com;* are reserved for vendor-specific collations created by the owner of the domain name following the "vnd-" prefix (
e.g. vnd-example.com for the vendor example.com). Registration of such collations (or the name space as a whole) with intended use of "Vendor" is encouraged when a public specification or open-source implementation is available, but is not

I'd recommend this to use the option syntax: eg vnd=domain.com

...
4.2.2. Equality

The equality test always returns "match" or "no-match" when supplied 
valid input, and MAY return "undefined" if one or both input strings
are not valid.

MUST return "undefined" if either input string is not valid.
(These have to be in sync or the model fails.)
...

Application protocols MAY return position information for substring
matches. If this is done, the position information SHOULD include
both the starting offset and the ending offset for each match. This

is important because more sophisticated collations can match strings of unequal length (for example, a pre-composed accented character can match a decomposed accented character). All matching substrings
should be reported, even overlapping matches (as when "ana" occurs twice within "banana").

This is is very problematic. It is common for characters to be ignored in matching. Let's suppose that "-" is ignored. Then given "ana" and "-----ana-----" the document is saying that an implementation MUST return
(0,8), (1,8), (2,8), (3,8),...(5,8)
(0,9), (1,9)...
...
(0,13), ... (5,13)
Nobody wants these results. Suggested minimal replacement:

However, there are circumstances where a collation may ignore characters. (See Unicode Collation Algorithm [8].) In such cases multiple overlapping matches may be suppressed, as specified in the registration.

Note: the document doesn't say whether the indexing is zero-based or one-based, nor whether the indexing is on character boundaries or storage (byte or 16-bit unit) boundaries. This needs to be addressed, at least to say that that information must be supplied. There is an appropriate place below.

...

The ordering operation determines how two strings are ordered. It
MUST be trichotomous and reflexive. For valid input, it MUST be 
transitive.

This needs changing. For valid input it must be trichotomous. Otherwise it is actually tetrachotomous, because of the "undefined" value.

...
4.3. Sort Keys

This section needs to be either removed, or changed significantly. (Comments sent to authors.)

...
4.4. Use of Lookup Tables

Some collations use customizable lookup tables, e.g. because the
tables depend on locale and may be modified after shipping the
software. Collations which use more than one customizable lookup 

table in a documented format MUST assign numbers to the tables they

...must assign versions

(versions may be of the format "1.3.5a", for example, and may not be numbers).

...
Returning just the starting offset is not 
acceptable. This rule is necessary because advanced collations can
treat strings of different lengths as equal (for example, pre-
composed and decomposed accented characters).

Here it should say that a specification returning positioning information MUST specify the interpretation of the positions: whether numbers are zero or one based and whether the offsets are character offsets or storage offsets.
 

...

5.4. Canonicalization Function

Canonicalization is a different function, which has not been mentioned. If a collation specification is also to offer canonicalization, then that has to be defined on the same level as matching, ordering, etc. That is, it needs a section 4.2.5.
In particular, it must be clear that:
- canonicalization may return "undefined" or a string value.
- canonicalization is coordinated with equality:
-- if either string is invalid the result is "undefined"
-- for valid input the equality operation returns "match" for two strings if and only if their canonicalizations produce the same strings.

...
Other collation registrations are owned by the individual(s) listed
in the contact field of the registration and IANA will preserve this 
information. Changes to a registration MUST be approved by the

owner. In the event the owner cannot be contacted for a period of one month and a change is deemed necessary, the IESG MAY re-assign ownership to an appropriate party.

It should also allow ownership by an organization, not just individuals. Case in point: UCA.

...

7.4. Example Initial Registry Summary

The following is an example of how IANA might structure the initial
registry summary.txt file:

Collation
 Functions Scope Reference 

--------- --------- ----- --------- Common Use Collations: i;nameprep;v=1;uv=3.2 e, o, s i18n [RFC XXXX] i;basic;uca=3.1.1;uv=3.2 e, o, s i18n [RFC XXXX]

=> i;basic;version=5.0 e, o, s i18n [Unicode]

The Unicode Consortium would be registering UCA with CLDR versions. This is assuming the above description is changed to NOT put in language (otherwise we get an explosion). I guess the table would look like the following:

i;uca;version=u5.0.c1.4 e, o, s i18n [Unicode]
i;uca;version=u5.0.c1.3 e, o, s i18n [Unicode]
...
i;uca;version= u4.1.c1.4 e, o, s i18n [Unicode]
i;uca;version=u4.1.c1.3 e, o, s i18n [Unicode]
...

I'd also recommend that there be a column to indicate that the collation does take parameters, and whether or not the language can be specified in a parameter.

The registration form should be checked against the summary, to make sure that all of the information in the summary can be (clearly and mechanically) derived from the registration form. For example, it is unclear where the Scope comes from.

...

The following table includes
some example reasons to reject a registration with cause:
...
o The collation identifier fails to precisely identify the version

numbers of relevant tables to use.

There is an over-emphasis on the version number. Fundamentally the version is yet another parameter; the main difference is in matching -- if not specified the servers should use the latest version that it has.
 

...
9.4.1. Basic Collation Description 

The basic collation is intended to provide tolerable results for a

number of languages for all three operations (equality, substring and

ordering) so it is suitable as a mandatory-to-implement collation for protocols which include ordering support. The ordering operation of the basic collation is the Unicode Collation Algorithm [8] version 14 (UCAv14).

The latest version should be used, which is 5.0 (http://www.unicode.org/reports/tr10/ ) What it has there is the revision number: the version number should be used. (I know there is a oddity in versions before 4.0, but that shouldn't apply here.)

The equality and substring operations are created as described in
UCAv14 section 8. While that section is informative to UCAv14, it is
normative to this collation
 specification. 

Delete the following section. It is wrong, and the full specification is in the UCA Spec. (See other document for options).

This collation is based on Unicode version 3.2, with the following
tables relevant: