L2/06-038 Source: Mark Davis, Ken Whistler Subject: UAX #15 Clarifications and FAQ Date: 2006/01/26 ======= Ken had a few points in the attached message to the IETF that should really be in either UAX#15 or in a faq. These are: 1. UAX #15 should also point out that a viable approach to dealing with the differences in normalization in identifiers are to disallow an identifiers containing characters or sequences that are handled differently: a. any characters in corrigenda (this would be 6 in the case of IDN) b. any of the problem sequences. 2. The text pointing out how rare the 6 problematic characters are should be in a FAQ. ========== > > > Corrigendum #2, which corrected on oversight in not including > U+FB1D in CompositionExclusions.txt (but did not change decomposition > in any way) went in for Unicode 3.1, and thus > was grandfathered in for IDNA. > > What remains are: > > Corrigendum #3: U+F951 > > Corrigendum #4: U+2F868, U+2F874, U+2F91F, U+2F95F, U+2F9BF > > Those are all rare Han characters. They are not even rare > *unified* Han characters -- they are compatibility characters > included only for compatibility with a single East Asian > character set standard each: U+F951 for a duplicate character > in KS X 1001, and the other 5 for CNS 11643-1992. That's why > they have canonical decomposition mappings in the first place. > > The duplicate character in KS X 1001 is a rare character in > Korean to begin with -- in a South Korean standard, where the > use of Han characters at all is uncommon in actual data. And > this is a *pronunciation* duplicate, which even if it were used > would very likely be inconsistently and incorrectly used by > an end user, because there is no visual way for them to make > the correct distinctions. > > The 5 characters from CNS 11643-1992 are of even less utility. > They are minor glyphic variants of unified characters -- the > kinds of distinctions which are subsumed *already* within all > the unified Han ideographs in the Unicode Standard. *Furthermore*, > they are from the Extension B in 10646/Unicode -- which covers > Han characters that weren't important enough to be included in > the standard for *years*. They are from planes 4 - 15 of > CNS 11643-1992, which *never* saw any commercial implementation > in Taiwan. The IT systems in Taiwan almost all implemented > Big Five, instead, which was a slight variant on Planes 1 and 2 > of CNS 11643-1986, and which included *none* of the 5 glyph > variants in question here. I would challenge anyone to even > find exemplars of them in any data from Taiwan other than > Unicode test data or people building tables for CNS 11643-1992. > > >>>> > and do not affect any >>>> > strings once they are normalized (that is, on any past or future >>>> > system they still remain normalized). So all of the IDNs that are >>>> > registered are completely unaffected by these corrigenda. >>> >>> I recently caught up with this thread, and have one suggestion (to >>> everyone, and not just Mark) that may modify how you think about the >>> impact of normalization modifications: >>> >>> NFKC in StringPrep is not only used for IDN. It is also used to >>> normalize username and passwords (SASLprep). >>> >>> The argument that the strings are rare enough to never be noticed may >>> hold for "opt-in" services like IDN, but it is less clear that it >>> holds for security applications. >>> >>> If you can create a string that is modified in one way by one SASLprep >>> application, and in another way in another implementation, you have >>> potential security compromises as a result. > > > The way to address this, to prevent even the whiff of a problem > of incompatible results, is to simply disallow the 6 (not 7) > characters in question for the input set, period. I guarandamntee > you no end user is ever going to miss them. > > At that point the mountain reverts back to the molehill it properly > is, and you can simply lightly step on that little molehill > to flatten it back down nice and smooth. > > --Ken