L2/06-167

From: Mark Davis
Subject: 	Unicode security subcommittee feedback on iab-idn-nextsteps
Date: 	Wed, 05 Apr 2006

On behalf of the Unicode security subcommittee, I'm conveying some 
additional points that should be clarified or fixed in 
http://www.iab.org/documents/drafts/draft-iab-idn-nextsteps-04.txt.

* Section 2.2.3: "characters that are essentially identical will not match"
What is meant by "essentially identical"? Does this mean identical in 
appearance, identical in internal representation, identical in 
semantics, canonically equivalent (same NFC forms), or compatible 
equivalent (same NFKC forms)? The intent needs to be clarified, 
otherwise the statement is subject to misinterpretation.

* Section 2.2.3: "This Unicode normalization process [does not account 
for] equivalences that are language or script dependent"
Which what is meant by "script-dependent equivalences"? Can you provide 
an example?

* Section 2.2.3: "U+00F8 [...] and U+00F6 [...] are considered to match 
in Swedish"
"Match" needs some clarification. In accordance with Swedish standards, 
when collating with Swedish locale, all major implementations match 
these characters at the first and second level, but not at a lower 
level. Thus they are not exact matches: this might be better phrased in 
terms of equivalence.

* Section 2.2.3: "Even if the language is known and language-specific 
rules can be defined, dependencies on the language do not disappear"
It is unclear what this means. Could you give an example?

* Section 2.2.1: "Those characters are not treated as equivalent 
according to the Unicode consortium while...".
This is somewhat ad hominem. It should rather be "...according to the 
Unicode Standard while..."

* Section 2.2.1: "..confusion in Germany, where the U+00F8 character is 
never used in the language".
That is not true, there are entries with that character in the Duden 
dictionary.

* Section 2.2.4: "This is because [...] some glyphs [...] have been 
assigned different codepoints in Unicode".
This is incorrect: glyphs are not assigned to codepoints; characters are.

* Section 2.2.6: "Is the answer the same for words two [sic] different 
languages that translate into each other?".
This is completely orthogonal to IDNs (cf "Is 'cat' the same as 'gato' 
or the same as 'katze'?").

* Section 2.2.7: "the IESG statement [...] that a registry should have a 
policy about the scripts, languages, codepoints and text directions".
This appears to not be an accurate paraphrase of 
(http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt). That document 
rather says a registry "MIGHT want to prevent particular characters", 
"MIGHT want to automatically generate a list of (...) strings and 
suggest that they also be registered" and lastly "it is suggested that a 
registry act conservatively". There is no such thing as "SHOULD" wording 
and, for instance, text direction is not mentioned.

* Section 2.2.8: "This maybe [...] because many other applications are 
internally sensitive only to the appearance of characters and not to 
their representation".
This is reversed. The vast majority of application are internally 
sensitive only to the representation, not to the appearance. Exceptions 
would be OCR, for example.

* Section 2.2.8: "A change in a code point assignment (...) may be 
extremely disruptive".
This suggests that the consortium capriciously changes code points. 
After the merger with ISO 10646 there was only one point at which the 
Unicode consortium changed code points: Unicode 2.0.0 (July, 1996): The 
characters in the Korean Hangul block were moved to be part of a new, 
larger block with all 11,152 Hangul syllables.

As a result of the disruption that this caused, the Unicode Consortium 
and ISO/IEC SC2 resolved never to change code points in the future, and 
no changes have ever been done since.

* Section 3.1.1: "...such as code points assigned to font variations...".
Which characters are these referring to? Is it to just characters that 
are resolved by an NFKC normalization, or does it refer to others?

* Section 4.5: "the whois protocol itself (...) is ASCII-only".
This appears to be inaccurate. The Whois protocol 
(http://www.ietf.org/rfc/rfc3912.txt?number=3912) has no mechanisms to 
indicate which character encoding is being used, but the protocol is 
8-bit clean and it is indeed used so by many (for instance, DENIC has a 
UTF-8 implementation up and running).