From: Shawn Steele (email@example.com)
Date: Thu Jan 27 2005 - 11:57:02 CST
> "Simon" said:
> By referencing Unicode 3.2, StringPrep use the old interpretation.
> Clarifying this would be good
So StringPrep has the identical ambiguity here because it doesn't
reference a particular interpretation of the standard, but rather it
references a document that is, obviously :-), a bit ambiguous.
> Let's say x = U+1100 U+0300 U+1161. ToUnicode(x) = x by definition
(see > 4.2 of RFC 3490). ToAscii(ToUnicode(x)) =
> xn--ksa1467f, with the fix (i.e., how IDN is specified to work). You
> then get ToUnicode(ToAscii(ToUnicode(x))) = U+AC00 U+0300, which
> according to PR29 would be "wrong". With the proposed fix you would
> get U+1100 U+0300 U+1161 instead. There is nothing invalid about
> these IDN strings, although they supposedly do not occur naturally.
I think your example's mixed up. U+1161 is blocked from combining with
U+1100 by the U+0300 in either form. Since U+1100 and U+1161 are start
characters the change shouldn't impact this case, both interpretations
should consistently normalize U+1100 U+0300 U+1161 to U+1100 U+0300
U+1161 (unchanged) and U+AC00 U+0300 to U+AC00 U+0300 (also unchanged).
The problem would involve mixed up combining classes. If I have one of
these messed up strings u (for Unicode), and do ToAscii(u) on it, then
I'll get an ascii form a:
a = ToAscii(u)
Then if we do
u2 = ToUnicode(a)
a2 = ToAscii(u2)
Now, using the "fixed" normalization, a == a2 && u == u2, however if we
used the alternate interpretation of the old UAX doc, then u != u2 and
NFKC(u) != NFKC(u2) so therefore a != a2. This is because u was
eventually normalized twice by the time we get to a2.
The IDN/StringPrep RFCs require that ToAscii(ToUnicode(a)) == a. In
this example it would not, which is why I'm saying that this string
would pretty much be illegal according to IDN. IDN needs this fix as
badly as UAX 15 does.
Software Design Engineer
(Normalization & IDN APIs)
This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 11:57:00 CST