Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Jan 27 2005 - 08:35:26 CST

Next message: Peter Kirk: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

Previous message: Bernard Desgraupes: "Error in default collation element computation (?)"
In reply to: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Peter Kirk: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Simon Josefsson <jas@extundo.com> writes:

>> The point is that with the old definition the concept of
>> "normalized" is not well defined.
>
> Oh, but it is, if you ignore the non-normative example code and
> introduction.

It is not. The normative part defines a "normalize" function which
transforms a string to another string. Such a function automatically
yields a "normalized" predicate only if it's idempotent.

A "normalized" predicate derived from an idempotent "normalize"
function can be formulated in two ways:
- x is normalized iff there exists y such that normalize(y) = x
- x is normalized iff normalize(x) = x
and they are equivalent only if normalization if idempotent.

Which predicate is implied by old implementations which followed by
the text rather than the example code? Since they were written under
the assumption that that these two definitions are equivalent, they
don't necessarily behave consistently at all.

It's not that the old definition yields a different "normalized"
predicate than was intended, but that it's meaningless to obtain
the "normalized" predicate derived from a non-idempotent function.

Let's consider a lookup table which normalizes keys before storing
them in the table, and doesn't keep the original unnormalized forms
used to create the entries - obtaining the set of keys returns the
normalized keys. This is meaningful if the function is idempotent.
Let's say someone performs an equivalent of this Perl code:

   foreach my $key (keys %table) {
      my $value = $table{$key};
      # ... do something with $key and $value
   }

Since it looks up keys which are extracted from the table, it has the
right to assume that $table{$key} will always find the key. If it's
not found, well, there is a bug in the table or in the normalization
function it uses - not in this code. The code has the right to abort,
or it may just misbehave if written in a language with poor recovery
from errors. A set which doesn't contain the elements it contains
doesn't make sense.

> I'm asking whether this theoretical improvement is worth creating
> problems for IDN/StringPrep implementations that need to worry about
> normalization stability between Unicode versions.

It already has a problem: it relies on a contradictory definition.

If it's fixed now, there might be some problems at transition time
(extremely unlikely).

If it's not fixed now, there might be problems forever (also unlikely,
but more likely because the time is much longer).

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Peter Kirk: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Previous message: Bernard Desgraupes: "Error in default collation element computation (?)"
In reply to: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Next in thread: Peter Kirk: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 08:36:12 CST