[idn] Re: Last Call: Preparation of Internationalized Strings

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Jun 02 2002 - 11:27:50 EDT


There have been two corrections to normalization since Unicode 3.0. One
involved a Chinese (Han) compatibility character that was mapped to the
wrong "normal" character by error. The other involved a Yiddish
(Hebrew) compatibility character that should have had a compatibility
mapping, but did not, also by error.

Both corrections were made to characters that are supposedly "very rare"
in actual use, so that the real-world impact would be minimal. Neither
one has anything to do with transcoding tables.

I know you are very concerned that Unicode has "broken its promise" by
making changes to the normalization tables after claiming they would not
do so. I think if the corrections had not been made, there would have
been an equal but opposite reaction that Unicode was too stubborn to
correct its own mistakes, and that NFKC was rendered "useless" because
of these two incorrect mappings.

The pages explaining the corrigenda include lengthy, detailed
explanations of why the Technical Committee felt they were necessary and
justified. As someone already mentioned, one of the justifications
given for the Yiddish change was that no normative references existed
*yet* for the Unicode normalization tables (i.e. from IDN). This
implies that once such normative references *do* exist, a similar
decision to correct an error might not be made.

I imagine these were very difficult decisions for the UTC, who knew that
someone would jump on the changes immediately as evidence that
normalization is inherently unstable and Unicode is therefore "not

It's true that we are relying on "enlightened statesmen" to make the
right decisions and not, say, decide one day to add a compatibility
mapping for U+00C6 LATIN CAPITAL LETTER AE that would break everything.
The UTC has tried to assure us that such a thing will not happen, but in
the end, all we can do is trust.

-Doug Ewell
 Fullerton, California

----- Original Message -----
From: <simon+idn@josefsson.org>
To: <idn@ops.ietf.org>
Sent: Sunday, June 02, 2002 5:44 am
Subject: [idn] Re: :Re: Last Call: Preparation of Internationalized

[Resending with different From: address.]

Patrik Fältström <paf@cisco.com> writes:

> --On 2002-05-30 12.16 +0200 Simon Josefsson <simon+idn@josefsson.org>
>> This is interesting -- has the Unicode consortium promised to always
>> update the CK normalization tables in a forward compatible way?
> Yes.

The reference for that statement seem to be (correct me if I'm wrong)

| Normalization. Once a character is encoded, its canonical combining
| class and decomposition mapping will not be changed in a way that will
| destabilize normalization.

Which looks good. However, reading on:

| The decomposition mapping may only be changed at all in the following
| _exceptional_ set of circumstances:
| + when there is a clear and evident mistake identified in the Unicode
| Character Database (such as a typographic mistake), and
| + when that error constitutes a clear violation of the identity
| stability policy (#4), and
| + when the correction of such an error does not violate constraints

So it appears as if the statement isn't strictly true?

A further security consideration of IDNA could be that whenever such
modifications is done in the Unicode standards, they may be exploited
and it should be an operational consideration to never register
domains, issues PKIX certifices for domains, create Kerberos realms,
create PGP keys, etc, for IDNs that contains characters that have
their decomposition mapping changed by the Unicode consortium.

It seems as if a modification of this kind occured between Unicode 3.0
and 3.1: http://www.unicode.org/versions/corrigendum2.html.

The conclusion here is that this isn't a practical problem -- only one
character changed normalization between 3.0 and 3.1 and none between
3.1 and 3.2 AFAIK. I am more worried about the transcoding problem.

This archive was generated by hypermail 2.1.2 : Sun Jun 02 2002 - 09:40:44 EDT