Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Simon Josefsson (jas@extundo.com)
Date: Thu Jan 27 2005 - 03:46:22 CST

Next message: Radovan Garabik: "Greek sigma with acute accent"

Previous message: Jony Rosenne: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
In reply to: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Shawn Steele" <shawnste@winse.microsoft.com> writes:

> "Simon" said:
>
>> There is deployed code and standards that use the old interpretation.
> There is deployed code that use both of the interpretations.

Right, so there is a practical problem.

>> StringPrep, and IDN, will continue to use the old interpretation,
>> until they are updated to reference this update. There are no draft
>> documents on that, as far as I know.
> As far as I know (I could be wrong), StringPrep & IDN don't specify
> which interpretation of the UAX are "correct" for those RFCs.

Those specifications were published before the problem was discovered,
so they couldn't have specified what to do.

By referencing Unicode 3.2, StringPrep use the old interpretation.
Clarifying this would be good, because it is not universally accepted,
but sadly this doesn't seem to happen, leaving implementations with a
interoperability problem.

> Besides, these are not linguistically correct code points so names
> shouldn't really contain them. Additionally IDN requires that
> ToAscii(ToUnicode(x)) == x, which pretty much causes NFKC(x) == x
> (ToAscii does the NFKC step and x should already be NFKC.) So any
> name that would be broken by this clarification would be illegal
> anyway in IDN.

No, that is false. Let's say x = U+1100 U+0300 U+1161. ToUnicode(x)
= x by definition (see 4.2 of RFC 3490). ToAscii(ToUnicode(x)) =
xn--ksa1467f, with the fix (i.e., how IDN is specified to work). You
then get ToUnicode(ToAscii(ToUnicode(x))) = U+AC00 U+0300, which
according to PR29 would be "wrong". With the proposed fix you would
get U+1100 U+0300 U+1161 instead. There is nothing invalid about
these IDN strings, although they supposedly do not occur naturally.

>> I'd wish that this was only about punishing people that came to the
>> "wrong conclusion". I believe the previous situation was perfectly
>> clear, even if that situation is problematic, in that the introduction
>> text and example code were buggy. It seems to me that one problematic
>> situation is solved by creating other problems.
>
> Its obvious that the text disagreed with itself and the sample. Where
> the bug is seems to be somewhat subjective, however the NFKC(NFKC(x)) ==
> NFKC(x) is obviously desirable and was explicitly stated in the text.
> It is unfortunate that this test case wasn't included in the test file
> :-)

Indeed.

> Anyway, this has been well discussed already, and either way would
> require some people to fix their code, so I wouldn't try to argue
> against the update :-)

It is not about merely fixing code. When/if StringPrep use Unicode
4.1 or later, with the fix, there will be an upgrade problem with
interopability and at worst security implications.

Thanks,
Simon

Next message: Radovan Garabik: "Greek sigma with acute accent"
Previous message: Jony Rosenne: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
In reply to: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 03:49:04 CST