L2/06-262

3.5. Naming Guidelines While this specification makes no absolute requirements on the structure of collation identifiers, naming consistency is important, so the following initial guidelines are provided.
Collation identifiers with an international audience typically begin with "i;".

After the first ";" is normally the name of the general collation algorithm, followed by a series of algorithm modifications separated by the ";" delimiter. Parameterized modifications will use "=" to
delimit the parameter from the value. The version numbers of any lookup tables used by the algorithm SHOULD be present as parameterized modifications. Collation identifiers of the form *;vnd-
domain.com;* are reserved for vendor-specific collations created by the owner of the domain name following the "vnd-" prefix (
e.g. vnd-example.com for the vendor example.com). Registration of such collations (or the name space as a whole) with intended use of "Vendor" is encouraged when a public specification or open-source implementation is available, but is not

... Application protocols MAY return position information for substring matches. If this is done, the position information SHOULD include both the starting offset and the ending offset for each match. This
is important because more sophisticated collations can match strings of unequal length (for example, a pre-composed accented character can match a decomposed accented character). All matching substrings
should be reported, even overlapping matches (as when "ana" occurs twice within "banana").

... 4.4. Use of Lookup Tables Some collations use customizable lookup tables, e.g. because the tables depend on locale and may be modified after shipping the software. Collations which use more than one customizable lookup
table in a documented format MUST assign numbers to the tables they

... Returning just the starting offset is not acceptable. This rule is necessary because advanced collations can treat strings of different lengths as equal (for example, pre- composed and decomposed accented characters).

Other collation registrations are owned by the individual(s) listed
in the contact field of the registration and IANA will preserve this 
information. Changes to a registration MUST be approved by the



owner. In the event the owner cannot be contacted for a period of
one month and a change is deemed necessary, the IESG MAY re-assign
ownership to an appropriate party.

It should also allow ownership by an organization, not just individuals. Case in point: UCA.

...

7.4. Example Initial Registry Summary

The following is an example of how IANA might structure the initial
registry summary.txt file:

Collation
 Functions Scope Reference 


--------- --------- ----- ---------
Common Use Collations:
i;nameprep;v=1;uv=3.2 e, o, s i18n [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2 e, o, s i18n [RFC XXXX]

=> i;basic;version=5.0 e, o, s i18n [Unicode]

The Unicode Consortium would be registering UCA with CLDR versions. This is assuming the above description is changed to NOT put in language (otherwise we get an explosion). I guess the table would look like the following:

i;uca;version=u5.0.c1.4 e, o, s i18n [Unicode]
i;uca;version=u5.0.c1.3 e, o, s i18n [Unicode]
...
i;uca;version= u4.1.c1.4 e, o, s i18n [Unicode]
i;uca;version=u4.1.c1.3 e, o, s i18n [Unicode]
...

I'd also recommend that there be a column to indicate that the collation does take parameters, and whether or not the language can be specified in a parameter.

The registration form should be checked against the summary, to make sure that all of the information in the summary can be (clearly and mechanically) derived from the registration form. For example, it is unclear where the Scope comes from.

...

The following table includes
some example reasons to reject a registration with cause:
...
o The collation identifier fails to precisely identify the version



numbers of relevant tables to use.

There is an over-emphasis on the version number. Fundamentally the version is yet another parameter; the main difference is in matching -- if not specified the servers should use the latest version that it has.

...
9.4.1. Basic Collation Description 

The basic collation is intended to provide tolerable results for a

number of languages for all three operations (equality, substring and


ordering) so it is suitable as a mandatory-to-implement collation for 
protocols which include ordering support. The ordering operation of

the basic collation
 is the Unicode 
Collation Algorithm [8] version 14
(UCAv14).

The latest version should be used, which is 5.0 (http://www.unicode.org/reports/tr10/ ) What it has there is the revision number: the version number should be used. (I know there is a oddity in versions before 4.0, but that shouldn't apply here.)

The equality and substring operations are created as described in
UCAv14 section 8. While that section is informative to UCAv14, it is
normative to this collation
 specification.

Delete the following section. It is wrong, and the full specification is in the UCA Spec. (See other document for options).

Subject:	Comments on draft-newman-i18n-comparator-12.txt
From:	Mark Davis
Date:	2006-07-28