L2/06-186

Subject: Re: draft-newman-i18n-collation-09.txt just posted
Date: 	Wed, 10 May 2006 15:08:53 +0200
From: 	Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
To: 	ietf-imapext@imc.org, ietf-mta-filters@imc.org


Mark Davis writes:
> The release of this is timely (we didn't get notified of a 07 or 08
> draft), since the Unicode Technical Committee is meeting next week,
> and can discuss it.
>
> Could you indicate which of the items raised in the email of
> 2006-02-21 from the Unicode Technical Committee have been addressed
> in this release (and if not accepted then why)? That would help
> greatly with the review. (I couldn't find any archive for discussion
> of draft-newman-i18n-comparator where that email could be publicly
> linked from, so I am appending it at the end of this message.) At a
> quick glance, it appears that a number of comments have been
> incorporated.

Lots. Some not. See below.

It is possible that some of my changes don't satisfy you. I had
conflicting requests for many things. Feel free to repeat, rephrase or
add arguments.

> Mark
>
> BTW, despite the subject of the message, the document is at
> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.
> It helps to send out a link, especially if the name (comparator vs
> collation) is wrong ;-)

Mea culpa. My apologies.

...
>> To:   Network Working Group
>> Re:   draft-newman-i18n-comparator
>> Date:         2006-02-21
>> From:         Unicode Technical Committee
>>
>> The Unicode Technical Committee has reviewed the document
>> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.
>> While UTC is in favor of the goal, there are a number of problems
>> with the document. The main problems are outlined below. Once these
>> are addressed, then further review can continue.
>>
>>     Details
>>
>>       > 2.1 Definitions
>>
>>         Content
>>
>> The document needs to include the definitions of the technical terms
>> used in the document,  including all those that may not be familiar
>> to implementers, such as "trichotomous" and "collation identifiers".
>> In particular, the notion of a substring is /prima facie/ quite
>> simple, but there are complications that require a clear definition.
>> The text in the document does not make clear that there may be more
>> than one match for a substring in a string, and that the matches can
>> overlap. It says "the starting offset", for example, when there may
>> be multiple ones.

Changed.

>> Moreover, language sensitive matches have additional complications
>> which need to be called out. For more information, see
>> http://www.unicode.org/reports/tr10/#Searching

Not really changed. As I recall, I added a little bit of text.

>>         Format
>>
>> If there is a "Definitions" section, readers have a reasonable
>> expectation that that section should contain all the required
>> definitions. However, a number of definitions are scattered within
>> the text. One of two approaches should be taken
>>
>>    1. Move all the definitions into this section.
>>    2. Remove the definitions section, but clearly call out in the text
>>       the definitions of  each terms on its own line.
>>
>> Mixing these two styles is needlessly confusing for readers.

Not changed; I'm going by what confuses reviewers.

>>       > 2.4 Sort Keys
>>
>> The use of the term "collation canonicalization" to refer to sort
>> keys is very misleading. ...

Changed; the text now speaks of sort keys. I'm afraid there still are
instances of the old term around, I found one today.

>> > 3.2
>>
>> This specifies that clients that support disconnected operation
>> should not use wildcards while clients that provide collation
>> operations only when connected to the server may use wildcards.

This restrinction has been lifted.

>> The EBNF syntax shown in section 3.2 says that the collation-wild
>> must not exceed 255 characters total while the section 3.1 specifies
>> that the collation name must not exceed 254 characters.

Brought into sync.

>> It seems having the same maximum possible length for both collation
>> name and wildcard string would be desirable for actual
>> implementations.

I picked 254, not 255, but I confess I cannot remember why.

>>       > 4.2.1 Equality
>>
>> It needs to be made clear that the return values are not physically
>> the strings "match", etc. but enumerated values such as /equal/ and
>> /not_equal/.

Changed. Also other similar changes.

>> One extremely important point is that for a given comparator, the
>> equality function must be synchronized with the ordering function.

I've done this and all the other equivalences/connections/implications I
could see.

>> The term 'error' is also problematic, since what is really at issue
>> is a question of domain. For all those strings in the domain, either
>> 'equal' or 'not_equal' should be returned from the equality
>> function. For any string not in the domain, 'undefined' should be
>> returned.

Not changed. Back in February, I agreed that "error" was not ideal, but
did not see "undefined" as better, and could not find a really apt
term. The collations were a little too well-defined in the "undefined"
cases then.

However, in -10, I think they really will be undefined outside their
domain, so I'll change to using "undefined" instead of "error". (I'm
removing the bits that fall back to i;octet.)

>> There is a typo at the 4'th line of the second paragraph of the
>> section 4.2 saying "... For example, an collation" which should be
>> changed to "... For example, a collation" instead.

Fixed.

>>       > 4.2.2 Substring
>>
>> Prefix and suffix matching are not fully spelled out.

I think they are now.

>> The operations and their results must be clarified. And as noted
>> before, it is very important to precisely define the substring
>> operations, especially the starting offset and ending offset. It
>> also must be clarified whether what is being asked for is the first
>> possible matching location in the string, the last, or the nth one.

Partly changed. I didn't do the bits you ask for in the last sentence. I
can add an open issue.

>>       > 4.3.3 Ordering
>>
>> > It MUST be transitive and trichotomous.
>>
>> As above, these should be defined.

I did not, since I think this document is the wrong place to define
these terms.

>> The exposition in this section would be simpler if you also defined
>> "reversible", whereby f(a,b) = less iff f(b,a) = greater.

The exposition changed enough as a result of other commens that I
isregarded this comment.

>> An 'undefined' value can be allowed if, as per equality above, it
>> means that at least one of the operands is outside of the domain.
>> The function then imposes a total order on all strings in the
>> domain; moreover, a wrapper can easily convert the function to a
>> total order over all strings by putting all items outside the domain
>> either below or above the ones in the domain -- or even excluding
>> them,/ at its choice./

I'm doing something like this in -10. (Removing the fallback to i;octet.)

>> [Note: it is very important to avoid the confusion between
>> "identical" and "equal". According to a caseless compare, "Mark" and
>> "mark" are equal; however, the strings are not identical.]

Changed all over the place.

>> [Either 'ordering function' or 'comparison function' should be used
>> consistently, not sometimes 'collations'].

Changed.

>>       > 4.3.  Internal Canonicalization Algorithm
>>
>> This section is difficult to understand.

Changed; I hope the new text is better.

>>       > 4.4.  Use of Lookup Tables
>>
>> It is not at all clear what is meant by "customizable lookup tables".

Clarified and partly removed.

>>       > 4.5.  Multi-Value Attributes
>>
>> This is very unclear.

Deleted.

>> This is a very important feature that needs to be spelled out in
>> detail, and clearly reflected in the template for registration. In
>> particular, the template should have provision for multiple
>> attributes, with the ability to specify the acceptable operands for
>> that attribute. (See below). The specification of the operands could
>> be either a list of values, or a regular expression (with the former
>> preferred). Suggested regular expression syntax would be Perl or XML
>> Schema.

I asked Martin Drst and you to provide a new DTD. Martin said okay, I
don't remember whether you answered. I think the DTD should come before
this.

>>       > 5.1Character Encoding
>>
>>    The protocol specification has to make sure that it is clear on which
>>    characters (rather than just octets) the collations are used.  This
>>    can be done by specifying the protocol itself in terms of characters
>>    (e.g. in the case of a query language), by specifying a single
>>    character encoding for the protocol (e.g.  UTF-8 [3]), or by
>>    carefully describing the relevant issues of character encoding
>>    labeling and conversion.  In the later case, details to consider
>>    include how to handle unknown charsets, any charsets which are
>>    mandatory-to-implement, any issues with byte-order that might apply,
>>    and any transfer encodings which need to be supported.
>>
>> If a collation is able to advertise itself as being able to handle,
>> say, SJIS and UTF-8, then there should a required description of a
>> protocol for indicating that and for communicating which encodings
>> are handled, and how it handles error conditions (such as a charset
>> outside of those it can handle. Otherwise, it is difficult to
>> understand how this paragraph would be applied in practice.
>>
>>       > 5.3
>>
>> The section 5.3 specifies:
>>
>>     The protocol MUST specify how comparisons behave in the absence of
>>     explicit collation negotiation or when a collation of "*" is
>>     requested. The protocol MAY specify that the default collation
>>     used in such circumstances is sensitive to server configuration.
>>
>> and the section 3.2 specifies:
>>
>>     ... If the wildcard string matches multiple collations, the server
>>     SHOULD select the collation with the broadest scope (preferably
>>     international scope), the most recent table versions and the
>>     greatest number of supported operations. A single wildcard
>>     character ("*") refers to the application protocol collation
>>     behavior that would occur if no explicit negotiation were used.
>>
>> These appear inconsistent.

Changed.

>>       7.5.  Example Initial Registry Summary
>>
>> The sample registry would suffer a combinatorial explosion if
>> parameters are not handled differently.
...

This is the DTD issue.

>> > 11.  Security Considerations
>>
>> This is insufficient. It should at least point to the problems
>> related in UCA and in
>> http://www.unicode.org/reports/tr36/tr36-4.html (note that that
>> document has been approved by the UTC and will be posted as an
>> approved version soon.)

It now refers.

>>     General
>>
>> One of the real problems with the IANA character registry is that the
>> entries are underspecified. It quite often occurs that two vendors
>> implement the same IANA charset conversion different ways, leading
>> to significant interoperability problems and text corruption. See,
>> for example,
>> http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
>>
>> We have the real concern that this registry could lead down the same path.

Noted.

>> > collation, it has to say so
>>
>> There are places where the text should be clarified, as to whether a
>> MUST or SHOULD is implied; this is just an example.
>>
>> > "comparator" vs "collator"
>>
>> Either one term or the other should be used consistently.

Collator, now.

>> > Unicode 3.2
>>
>> Unicode 3.2 is obsolete; the the reference versions for the Collation
>> Registry should be Unicode 5.0 and UCA 5.0, since those will be
>> approved and published by the time the Internet Application Protocol
>> Collation Registry has completed its review and been approved.

I'll update to the then-current versions immediately before submitting
the final draft as an RFC.

>> Because of the use of NamePrep, it is probably the case that Unicode
>> 3.2 also needs to be included, but strongly recommended for usage
>> only by protocols or systems dependent on NamePrep. Note that as of
>> UCA 4.0 and beyond, the version number of UCA is guaranteed to be
>> identical with the version number of Unicode that it is defined for.
>>
>> > Versioning
>>
>> This is tricky, and should be clarified. In many instances, it is
>> sufficient to use an unversioned collator, such as simply "UCA". In
>> other cases, there are requirements to use a specific version, or a
>> version of at least X. This needs to be described.

IETF documents should have only immutable references. Thus, I can
reference "UCAv14", but not "UCA", because the latter moves to v15, v16
and onwards.

Arnt