L2/06-038

Source: Mark Davis, Ken Whistler
Subject: UAX #15 Clarifications and FAQ
Date: 2006/01/26

=======

Ken had a few points in the attached message to the IETF that
should really be in either UAX#15 or in a faq. These are:

1. UAX #15 should also point out that a viable approach to dealing with
the differences in normalization in identifiers are to disallow an
identifiers containing characters or sequences that are handled differently:
a. any characters in corrigenda (this would be 6 in the case of IDN)
b. any of the problem sequences.

2. The text pointing out how rare the 6 problematic characters are
should be in a FAQ.

==========


>
>
> Corrigendum #2, which corrected on oversight in not including
> U+FB1D in CompositionExclusions.txt (but did not change decomposition
> in any way) went in for Unicode 3.1, and thus
> was grandfathered in for IDNA.
>
> What remains are:
>
> Corrigendum #3: U+F951
>
> Corrigendum #4: U+2F868, U+2F874, U+2F91F, U+2F95F, U+2F9BF
>
> Those are all rare Han characters. They are not even rare
> *unified* Han characters -- they are compatibility characters
> included only for compatibility with a single East Asian
> character set standard each: U+F951 for a duplicate character
> in KS X 1001, and the other 5 for CNS 11643-1992. That's why
> they have canonical decomposition mappings in the first place.
>
> The duplicate character in KS X 1001 is a rare character in
> Korean to begin with -- in a South Korean standard, where the
> use of Han characters at all is uncommon in actual data. And
> this is a *pronunciation* duplicate, which even if it were used
> would very likely be inconsistently and incorrectly used by
> an end user, because there is no visual way for them to make
> the correct distinctions.
>
> The 5 characters from CNS 11643-1992 are of even less utility.
> They are minor glyphic variants of unified characters -- the
> kinds of distinctions which are subsumed *already* within all
> the unified Han ideographs in the Unicode Standard. *Furthermore*,
> they are from the Extension B in 10646/Unicode -- which covers
> Han characters that weren't important enough to be included in
> the standard for *years*. They are from planes 4 - 15 of
> CNS 11643-1992, which *never* saw any commercial implementation
> in Taiwan. The IT systems in Taiwan almost all implemented
> Big Five, instead, which was a slight variant on Planes 1 and 2
> of CNS 11643-1986, and which included *none* of the 5 glyph
> variants in question here. I would challenge anyone to even
> find exemplars of them in any data from Taiwan other than
> Unicode test data or people building tables for CNS 11643-1992.
>
>
>>>> > and do not affect any
>>>> > strings once they are normalized (that is, on any past or future
>>>> > system they still remain normalized). So all of the IDNs that are
>>>> > registered are completely unaffected by these corrigenda.
>>>    
>>> I recently caught up with this thread, and have one suggestion (to
>>> everyone, and not just Mark) that may modify how you think about the
>>> impact of normalization modifications:
>>>
>>> NFKC in StringPrep is not only used for IDN.  It is also used to
>>> normalize username and passwords (SASLprep).
>>>
>>> The argument that the strings are rare enough to never be noticed may
>>> hold for "opt-in" services like IDN, but it is less clear that it
>>> holds for security applications.
>>>
>>> If you can create a string that is modified in one way by one SASLprep
>>> application, and in another way in another implementation, you have
>>> potential security compromises as a result.
>
>
> The way to address this, to prevent even the whiff of a problem
> of incompatible results, is to simply disallow the 6 (not 7)
> characters in question for the input set, period. I guarandamntee
> you no end user is ever going to miss them.
>
> At that point the mountain reverts back to the molehill it properly
> is, and you can simply lightly step on that little molehill
> to flatten it back down nice and smooth.
>
> --Ken