From: Simon Josefsson (jas@extundo.com)
Date: Thu Jan 27 2005 - 12:55:09 CST
"Shawn Steele" <shawnste@winse.microsoft.com> writes:
>> "Simon" said:
>
>> By referencing Unicode 3.2, StringPrep use the old interpretation.
>> Clarifying this would be good
>
> So StringPrep has the identical ambiguity here because it doesn't
> reference a particular interpretation of the standard, but rather it
> references a document that is, obviously :-), a bit ambiguous.
We seem to disagree on this.
I believe the old document is non-ambiguous. It is possible to follow
the old normative text and end up with an implementation that works
fine for all practically occurring strings. People have done exactly
this, and deployed the code.
It is unfortunate that such an implementation would behave badly for a
select few corner cases, but it is not the end of the world.
>> Let's say x = U+1100 U+0300 U+1161. ToUnicode(x) = x by definition
> (see > 4.2 of RFC 3490). ToAscii(ToUnicode(x)) =
>> xn--ksa1467f, with the fix (i.e., how IDN is specified to work). You
>> then get ToUnicode(ToAscii(ToUnicode(x))) = U+AC00 U+0300, which
>> according to PR29 would be "wrong". With the proposed fix you would
>> get U+1100 U+0300 U+1161 instead. There is nothing invalid about
>> these IDN strings, although they supposedly do not occur naturally.
>
> I think your example's mixed up. U+1161 is blocked from combining with
> U+1100 by the U+0300 in either form. Since U+1100 and U+1161 are start
> characters the change shouldn't impact this case, both interpretations
> should consistently normalize U+1100 U+0300 U+1161 to U+1100 U+0300
> U+1161 (unchanged) and U+AC00 U+0300 to U+AC00 U+0300 (also unchanged).
I disagree.
According to <http://www.unicode.org/review/pr-29.html>, with the old
wording, U+1100 U+0300 U+1161 would normalize into U+AC00 U+0300.
> The problem would involve mixed up combining classes. If I have one of
> these messed up strings u (for Unicode), and do ToAscii(u) on it, then
> I'll get an ascii form a:
> a = ToAscii(u)
I'm with you so far.
However, keep in mind that what 'a' looks like depending on how NFKC
was implemented in the IDN implementation.
If you use the old NFC language, for u = U+1100 U+0300 U+1161, you
will get a = xn--ksa1467f. You will get another output value if your
NFKC implementation, against the StringPrep specification, implement
NFKC with the proposed modification.
> Then if we do
> u2 = ToUnicode(a)
> a2 = ToAscii(u2)
>
> Now, using the "fixed" normalization, a == a2 && u == u2, however if we
> used the alternate interpretation of the old UAX doc, then u != u2 and
> NFKC(u) != NFKC(u2) so therefore a != a2. This is because u was
> eventually normalized twice by the time we get to a2.
>
> The IDN/StringPrep RFCs require that ToAscii(ToUnicode(a)) == a. In
> this example it would not, which is why I'm saying that this string
> would pretty much be illegal according to IDN.
I don't follow this part. Presumably you meant something else, much
like I mixed up the symbol language earlier. Specifically:
The invariant 'ToAscii(ToUnicode(a)) == a' doesn't hold. Consider the
string a = ß. ToUnicode(ß) = ß. ToAscii(ß) = ss. ß != ss.
> IDN needs this fix as badly as UAX 15 does.
I would agree that IDN and UAX15 need _a_ fix, but not necessarily the
proposed one.
Regards,
Simon
This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 12:57:29 CST