Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Thu Jan 27 2005 - 08:35:26 CST

  • Next message: Peter Kirk: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

    Simon Josefsson <jas@extundo.com> writes:

    >> The point is that with the old definition the concept of
    >> "normalized" is not well defined.
    >
    > Oh, but it is, if you ignore the non-normative example code and
    > introduction.

    It is not. The normative part defines a "normalize" function which
    transforms a string to another string. Such a function automatically
    yields a "normalized" predicate only if it's idempotent.

    A "normalized" predicate derived from an idempotent "normalize"
    function can be formulated in two ways:
    - x is normalized iff there exists y such that normalize(y) = x
    - x is normalized iff normalize(x) = x
    and they are equivalent only if normalization if idempotent.

    Which predicate is implied by old implementations which followed by
    the text rather than the example code? Since they were written under
    the assumption that that these two definitions are equivalent, they
    don't necessarily behave consistently at all.

    It's not that the old definition yields a different "normalized"
    predicate than was intended, but that it's meaningless to obtain
    the "normalized" predicate derived from a non-idempotent function.

    Let's consider a lookup table which normalizes keys before storing
    them in the table, and doesn't keep the original unnormalized forms
    used to create the entries - obtaining the set of keys returns the
    normalized keys. This is meaningful if the function is idempotent.
    Let's say someone performs an equivalent of this Perl code:

       foreach my $key (keys %table) {
          my $value = $table{$key};
          # ... do something with $key and $value
       }

    Since it looks up keys which are extracted from the table, it has the
    right to assume that $table{$key} will always find the key. If it's
    not found, well, there is a bug in the table or in the normalization
    function it uses - not in this code. The code has the right to abort,
    or it may just misbehave if written in a language with poor recovery
    from errors. A set which doesn't contain the elements it contains
    doesn't make sense.

    > I'm asking whether this theoretical improvement is worth creating
    > problems for IDN/StringPrep implementations that need to worry about
    > normalization stability between Unicode versions.

    It already has a problem: it relies on a contradictory definition.

    If it's fixed now, there might be some problems at transition time
    (extremely unlikely).

    If it's not fixed now, there might be problems forever (also unlikely,
    but more likely because the time is much longer).

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 08:36:12 CST