L2/06-167 From: Mark Davis Subject: Unicode security subcommittee feedback on iab-idn-nextsteps Date: Wed, 05 Apr 2006 On behalf of the Unicode security subcommittee, I'm conveying some additional points that should be clarified or fixed in http://www.iab.org/documents/drafts/draft-iab-idn-nextsteps-04.txt. * Section 2.2.3: "characters that are essentially identical will not match" What is meant by "essentially identical"? Does this mean identical in appearance, identical in internal representation, identical in semantics, canonically equivalent (same NFC forms), or compatible equivalent (same NFKC forms)? The intent needs to be clarified, otherwise the statement is subject to misinterpretation. * Section 2.2.3: "This Unicode normalization process [does not account for] equivalences that are language or script dependent" Which what is meant by "script-dependent equivalences"? Can you provide an example? * Section 2.2.3: "U+00F8 [...] and U+00F6 [...] are considered to match in Swedish" "Match" needs some clarification. In accordance with Swedish standards, when collating with Swedish locale, all major implementations match these characters at the first and second level, but not at a lower level. Thus they are not exact matches: this might be better phrased in terms of equivalence. * Section 2.2.3: "Even if the language is known and language-specific rules can be defined, dependencies on the language do not disappear" It is unclear what this means. Could you give an example? * Section 2.2.1: "Those characters are not treated as equivalent according to the Unicode consortium while...". This is somewhat ad hominem. It should rather be "...according to the Unicode Standard while..." * Section 2.2.1: "..confusion in Germany, where the U+00F8 character is never used in the language". That is not true, there are entries with that character in the Duden dictionary. * Section 2.2.4: "This is because [...] some glyphs [...] have been assigned different codepoints in Unicode". This is incorrect: glyphs are not assigned to codepoints; characters are. * Section 2.2.6: "Is the answer the same for words two [sic] different languages that translate into each other?". This is completely orthogonal to IDNs (cf "Is 'cat' the same as 'gato' or the same as 'katze'?"). * Section 2.2.7: "the IESG statement [...] that a registry should have a policy about the scripts, languages, codepoints and text directions". This appears to not be an accurate paraphrase of (http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt). That document rather says a registry "MIGHT want to prevent particular characters", "MIGHT want to automatically generate a list of (...) strings and suggest that they also be registered" and lastly "it is suggested that a registry act conservatively". There is no such thing as "SHOULD" wording and, for instance, text direction is not mentioned. * Section 2.2.8: "This maybe [...] because many other applications are internally sensitive only to the appearance of characters and not to their representation". This is reversed. The vast majority of application are internally sensitive only to the representation, not to the appearance. Exceptions would be OCR, for example. * Section 2.2.8: "A change in a code point assignment (...) may be extremely disruptive". This suggests that the consortium capriciously changes code points. After the merger with ISO 10646 there was only one point at which the Unicode consortium changed code points: Unicode 2.0.0 (July, 1996): The characters in the Korean Hangul block were moved to be part of a new, larger block with all 11,152 Hangul syllables. As a result of the disruption that this caused, the Unicode Consortium and ISO/IEC SC2 resolved never to change code points in the future, and no changes have ever been done since. * Section 3.1.1: "...such as code points assigned to font variations...". Which characters are these referring to? Is it to just characters that are resolved by an NFKC normalization, or does it refer to others? * Section 4.5: "the whois protocol itself (...) is ASCII-only". This appears to be inaccurate. The Whois protocol (http://www.ietf.org/rfc/rfc3912.txt?number=3912) has no mechanisms to indicate which character encoding is being used, but the protocol is 8-bit clean and it is indeed used so by many (for instance, DENIC has a UTF-8 implementation up and running).