L2/06-186 Subject: Re: draft-newman-i18n-collation-09.txt just posted Date: Wed, 10 May 2006 15:08:53 +0200 From: Arnt Gulbrandsen To: ietf-imapext@imc.org, ietf-mta-filters@imc.org Mark Davis writes: > The release of this is timely (we didn't get notified of a 07 or 08 > draft), since the Unicode Technical Committee is meeting next week, > and can discuss it. > > Could you indicate which of the items raised in the email of > 2006-02-21 from the Unicode Technical Committee have been addressed > in this release (and if not accepted then why)? That would help > greatly with the review. (I couldn't find any archive for discussion > of draft-newman-i18n-comparator where that email could be publicly > linked from, so I am appending it at the end of this message.) At a > quick glance, it appears that a number of comments have been > incorporated. Lots. Some not. See below. It is possible that some of my changes don't satisfy you. I had conflicting requests for many things. Feel free to repeat, rephrase or add arguments. > Mark > > BTW, despite the subject of the message, the document is at > http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt. > It helps to send out a link, especially if the name (comparator vs > collation) is wrong ;-) Mea culpa. My apologies. ... >> To: Network Working Group >> Re: draft-newman-i18n-comparator >> Date: 2006-02-21 >> From: Unicode Technical Committee >> >> The Unicode Technical Committee has reviewed the document >> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt. >> While UTC is in favor of the goal, there are a number of problems >> with the document. The main problems are outlined below. Once these >> are addressed, then further review can continue. >> >> Details >> >> > 2.1 Definitions >> >> Content >> >> The document needs to include the definitions of the technical terms >> used in the document, including all those that may not be familiar >> to implementers, such as "trichotomous" and "collation identifiers". >> In particular, the notion of a substring is /prima facie/ quite >> simple, but there are complications that require a clear definition. >> The text in the document does not make clear that there may be more >> than one match for a substring in a string, and that the matches can >> overlap. It says "the starting offset", for example, when there may >> be multiple ones. Changed. >> Moreover, language sensitive matches have additional complications >> which need to be called out. For more information, see >> http://www.unicode.org/reports/tr10/#Searching Not really changed. As I recall, I added a little bit of text. >> Format >> >> If there is a "Definitions" section, readers have a reasonable >> expectation that that section should contain all the required >> definitions. However, a number of definitions are scattered within >> the text. One of two approaches should be taken >> >> 1. Move all the definitions into this section. >> 2. Remove the definitions section, but clearly call out in the text >> the definitions of each terms on its own line. >> >> Mixing these two styles is needlessly confusing for readers. Not changed; I'm going by what confuses reviewers. >> > 2.4 Sort Keys >> >> The use of the term "collation canonicalization" to refer to sort >> keys is very misleading. ... Changed; the text now speaks of sort keys. I'm afraid there still are instances of the old term around, I found one today. >> > 3.2 >> >> This specifies that clients that support disconnected operation >> should not use wildcards while clients that provide collation >> operations only when connected to the server may use wildcards. This restrinction has been lifted. >> The EBNF syntax shown in section 3.2 says that the collation-wild >> must not exceed 255 characters total while the section 3.1 specifies >> that the collation name must not exceed 254 characters. Brought into sync. >> It seems having the same maximum possible length for both collation >> name and wildcard string would be desirable for actual >> implementations. I picked 254, not 255, but I confess I cannot remember why. >> > 4.2.1 Equality >> >> It needs to be made clear that the return values are not physically >> the strings "match", etc. but enumerated values such as /equal/ and >> /not_equal/. Changed. Also other similar changes. >> One extremely important point is that for a given comparator, the >> equality function must be synchronized with the ordering function. I've done this and all the other equivalences/connections/implications I could see. >> The term 'error' is also problematic, since what is really at issue >> is a question of domain. For all those strings in the domain, either >> 'equal' or 'not_equal' should be returned from the equality >> function. For any string not in the domain, 'undefined' should be >> returned. Not changed. Back in February, I agreed that "error" was not ideal, but did not see "undefined" as better, and could not find a really apt term. The collations were a little too well-defined in the "undefined" cases then. However, in -10, I think they really will be undefined outside their domain, so I'll change to using "undefined" instead of "error". (I'm removing the bits that fall back to i;octet.) >> There is a typo at the 4'th line of the second paragraph of the >> section 4.2 saying "... For example, an collation" which should be >> changed to "... For example, a collation" instead. Fixed. >> > 4.2.2 Substring >> >> Prefix and suffix matching are not fully spelled out. I think they are now. >> The operations and their results must be clarified. And as noted >> before, it is very important to precisely define the substring >> operations, especially the starting offset and ending offset. It >> also must be clarified whether what is being asked for is the first >> possible matching location in the string, the last, or the nth one. Partly changed. I didn't do the bits you ask for in the last sentence. I can add an open issue. >> > 4.3.3 Ordering >> >> > It MUST be transitive and trichotomous. >> >> As above, these should be defined. I did not, since I think this document is the wrong place to define these terms. >> The exposition in this section would be simpler if you also defined >> "reversible", whereby f(a,b) = less iff f(b,a) = greater. The exposition changed enough as a result of other commens that I isregarded this comment. >> An 'undefined' value can be allowed if, as per equality above, it >> means that at least one of the operands is outside of the domain. >> The function then imposes a total order on all strings in the >> domain; moreover, a wrapper can easily convert the function to a >> total order over all strings by putting all items outside the domain >> either below or above the ones in the domain -- or even excluding >> them,/ at its choice./ I'm doing something like this in -10. (Removing the fallback to i;octet.) >> [Note: it is very important to avoid the confusion between >> "identical" and "equal". According to a caseless compare, "Mark" and >> "mark" are equal; however, the strings are not identical.] Changed all over the place. >> [Either 'ordering function' or 'comparison function' should be used >> consistently, not sometimes 'collations']. Changed. >> > 4.3. Internal Canonicalization Algorithm >> >> This section is difficult to understand. Changed; I hope the new text is better. >> > 4.4. Use of Lookup Tables >> >> It is not at all clear what is meant by "customizable lookup tables". Clarified and partly removed. >> > 4.5. Multi-Value Attributes >> >> This is very unclear. Deleted. >> This is a very important feature that needs to be spelled out in >> detail, and clearly reflected in the template for registration. In >> particular, the template should have provision for multiple >> attributes, with the ability to specify the acceptable operands for >> that attribute. (See below). The specification of the operands could >> be either a list of values, or a regular expression (with the former >> preferred). Suggested regular expression syntax would be Perl or XML >> Schema. I asked Martin DŸrst and you to provide a new DTD. Martin said okay, I don't remember whether you answered. I think the DTD should come before this. >> > 5.1Character Encoding >> >> The protocol specification has to make sure that it is clear on which >> characters (rather than just octets) the collations are used. This >> can be done by specifying the protocol itself in terms of characters >> (e.g. in the case of a query language), by specifying a single >> character encoding for the protocol (e.g. UTF-8 [3]), or by >> carefully describing the relevant issues of character encoding >> labeling and conversion. In the later case, details to consider >> include how to handle unknown charsets, any charsets which are >> mandatory-to-implement, any issues with byte-order that might apply, >> and any transfer encodings which need to be supported. >> >> If a collation is able to advertise itself as being able to handle, >> say, SJIS and UTF-8, then there should a required description of a >> protocol for indicating that and for communicating which encodings >> are handled, and how it handles error conditions (such as a charset >> outside of those it can handle. Otherwise, it is difficult to >> understand how this paragraph would be applied in practice. >> >> > 5.3 >> >> The section 5.3 specifies: >> >> The protocol MUST specify how comparisons behave in the absence of >> explicit collation negotiation or when a collation of "*" is >> requested. The protocol MAY specify that the default collation >> used in such circumstances is sensitive to server configuration. >> >> and the section 3.2 specifies: >> >> ... If the wildcard string matches multiple collations, the server >> SHOULD select the collation with the broadest scope (preferably >> international scope), the most recent table versions and the >> greatest number of supported operations. A single wildcard >> character ("*") refers to the application protocol collation >> behavior that would occur if no explicit negotiation were used. >> >> These appear inconsistent. Changed. >> 7.5. Example Initial Registry Summary >> >> The sample registry would suffer a combinatorial explosion if >> parameters are not handled differently. ... This is the DTD issue. >> > 11. Security Considerations >> >> This is insufficient. It should at least point to the problems >> related in UCA and in >> http://www.unicode.org/reports/tr36/tr36-4.html (note that that >> document has been approved by the UTC and will be posted as an >> approved version soon.) It now refers. >> General >> >> One of the real problems with the IANA character registry is that the >> entries are underspecified. It quite often occurs that two vendors >> implement the same IANA charset conversion different ways, leading >> to significant interoperability problems and text corruption. See, >> for example, >> http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen. >> >> We have the real concern that this registry could lead down the same path. Noted. >> > collation, it has to say so >> >> There are places where the text should be clarified, as to whether a >> MUST or SHOULD is implied; this is just an example. >> >> > "comparator" vs "collator" >> >> Either one term or the other should be used consistently. Collator, now. >> > Unicode 3.2 >> >> Unicode 3.2 is obsolete; the the reference versions for the Collation >> Registry should be Unicode 5.0 and UCA 5.0, since those will be >> approved and published by the time the Internet Application Protocol >> Collation Registry has completed its review and been approved. I'll update to the then-current versions immediately before submitting the final draft as an RFC. >> Because of the use of NamePrep, it is probably the case that Unicode >> 3.2 also needs to be included, but strongly recommended for usage >> only by protocols or systems dependent on NamePrep. Note that as of >> UCA 4.0 and beyond, the version number of UCA is guaranteed to be >> identical with the version number of Unicode that it is defined for. >> >> > Versioning >> >> This is tricky, and should be clarified. In many instances, it is >> sufficient to use an unversioned collator, such as simply "UCA". In >> other cases, there are requirements to use a specific version, or a >> version of at least X. This needs to be described. IETF documents should have only immutable references. Thus, I can reference "UCAv14", but not "UCA", because the latter moves to v15, v16 and onwards. Arnt