Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Simon Josefsson (jas@extundo.com)
Date: Thu Jan 27 2005 - 15:02:19 CST

Next message: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

Previous message: Adam Twardoch: "Re: The Yoruba under-diacritic"
In reply to: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Markus Scherer: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Reply: Markus Scherer: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Shawn Steele" <shawnste@winse.microsoft.com> writes:

>> If you use the old NFC language, for u = U+1100 U+0300 U+1161, you
>> will get a = xn--ksa1467f.
>
> The xn-- is confusing about the Unicode code points represented, so
> I'm going to describe them in a different way. (also 'cause I doon't
> have a 'broken' implementation to generate the bad code points
> quickly ;-))

There is an online interface to one such implementation at
<http://josefsson.org/idn.php>, although I would argue that it is
correct, and not broken, at least until StringPrep/IDN is updated to
handle this issue.

> RFC 3490 4.2 clearly states that ToAscii(ToUnicode(a)), where a is an xn-- format string, must round trip, otherwise it's not a valid punycode string and then the API's supposed to return the originnal input, not the badly decoded input.
>
> Using the OLD strict interpretation of D2 that ignores the rest of the document:
>
> A) ToAscii(U+1100 U+0300 U+1161 U+0323) becomes the punycode representation of xn--(punycode of)U+AC00 U+0300 U+0323
>
> B) ToUnicode(xn--(punycode of)U+AC00 U+0300 U+0323) becomes U+AC00 U+0300 U+0323
>
> C) ToAscii(U+AC00 U+0300 U+0323) becomes the punycode representation of xn--(punycode of)U+AC00 U+0323 U+0300
>
> Since RFC 3490 4.2 specifies that ToAscii(ToUnicode(x)) == x, however in this case it doesn't, so the B xn-- string is invalid and ToUnicode would be required to return the xn-- string, NOT the Unicode version.
>
> This should fairly clearly prove that IDN is broken anyway with the old D2, so fixing this shouldn't be an issue for IDN. This change solves IDN problems, it doesn't introduce them.

I understand what you mean now.

Your argument works well for the sub-set of problem sequences that are
unstable under NFKC.

However, the argument does not work for all problem sequences, and in
particular it does not work for the example in PR29 I quoted.

As far as I can tell, the claim that all the PR29 problem sequences
are invalid IDN strings is false.

It would be interesting to find out what percentage of the problem
sequences are unstable under NFKC.

Thanks.

Next message: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Previous message: Adam Twardoch: "Re: The Yoruba under-diacritic"
In reply to: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Markus Scherer: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Reply: Markus Scherer: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 15:05:27 CST