Subject: Comments on IETF NextSteps
From: Mark Davis
There is a new version of the NextSteps document (at http://www.ietf.org/internet-drafts/draft-iab-idn-nextsteps-06.txt ), and it is now in the RFC-Editor queue to be published as an RFC.
There is much that the Unicode Consortium is in agreement with. In particular, we have for some time subscribed to the view that the character set should be limited to identifier characters: those characters required for use in words plus decimal digits, excluding punctuation, symbols, etc. This is the recommended approach in the Unicode Security guidelines ( http://www.unicode.org/reports/tr36/), and the identifier characters are as defined in http://www.unicode.org/reports/tr31/tr31-6.html.
However, there are many parts of this document that are extremely problematic, and we have been quite disappointed in the willingness of the IAB to enter into effective discussion of the issues. Some errors have been corrected in successive versions document in response to feedback, but others remain without any reason given for their retention, and each version adds yet more (but different) errors.
For example, the latest draft contains a new paragraph insinuating that NFKC is not well-defined ("...Is U+0131 U+0307 U+0307 (dotless i and two combining dot-above characters) equivalent to U+00EF or U+0069, or neither? NFKC does not appear to tell us...") The definition of NFKC is completely deterministic, and widely implemented, with sample code and extensive test data provided. The authors could have asked any one of a number of people what the actual results are (in this case, neither), or made simple test programs themselves using freely available software, or used any number of online tools that would answer the questions they pose.
Other statements like "Consequently, conversion from to Unicode may potentially lose information." are simply misleading to the reader. No significant encoding loses information when converted to Unicode. That's why, for example, the interpretation of all XML documents -- even if they use other encodings -- is in terms of what they would be if converted to Unicode. That's why companies like Google, Yahoo, Microsoft, Verisign, Apple, ... convert all incoming data to Unicode internally for processing without any worries about losing data. For this and many other cases, the authors have been asked repeatedly to supply any evidence for their purported problems, but none are ever forthcoming.
It is really unfortunate that there is so very much of value in the nextsteps document, yet so many clear errors and ill-founded conclusions -- with little effort made by the authors to involve experts in the field -- that the casual reader will not be able to separate truth from falsehood. And most importantly, these errors are used to justify the conclusion that they "...may leave us "stuck" at Unicode 3.2...".
The committee needs to discuss possible approaches to take in light of the impending publication of this document.