Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Simon Josefsson (jas@extundo.com)
Date: Wed Jan 26 2005 - 18:02:14 CST

  • Next message: Peter Kirk: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

    "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl> writes:

    > Simon Josefsson <jas@extundo.com> writes:
    >
    >> The first, "internal-idempotency", is that NFKC(NFKC(x)) = x.
    >
    > You surely meant NFKC(NFKC(x)) = NFKC(x).

    Yes, sorry!

    >> The second, "version-idempotency", is that NFKC3.2(NFKC4.0(x)) = x.
    >
    > I don't know how to fix this equation to make your claim meaningful.
    > As stated, the equation is often false.

    I think I really meant NFKC3.2(x) = NFKC4.1(x), for strings with
    assigned code points in 3.2, and assuming 4.1 will contain the
    proposed fix.

    > If you mean NFKC3.2(NFKC4.0(x)) = NFKC3.2(x), then it's true after the
    > fix (I think) but would not be true if NFKC4.0 was equal to NFKC3.2,
    > so you are proposing to make things worse.

    I believe NFKC3.2(NFKC4.1(x)) = NFKC3.2(x) would not hold for all x,
    if the fix is incorporated in 4.1. NFKC4.1 would "fix" the problem
    sequences, but NFKC3.2 would not.

    >> It is crucial that normalization forms remain stable over time. That
    >> is, if a string that does not have any unassigned characters is
    >> normalized under one version of Unicode, it must remain normalized
    >> under all future versions of Unicode. This is the backwards
    >> compatibility requirement.
    >
    > The point is that with the old definition the concept of "normalized"
    > is not well defined.

    Oh, but it is, if you ignore the non-normative example code and
    introduction.

    The fact that this "normalized" is not idempotent for a select few
    strings, that doesn't occur naturally, do not change that it is well
    defined.

    > Do you mean "a result of some normalization" or "such that
    > normalizing it further doesn't change it"? They should be the same,
    > but they were not. Now they will be.

    That requirement was not stated, only implied, but turned out to be
    incorrect (without the fix).

    >> Nowhere in the current document can I find any text that say that
    >> internal-idempotency was a design goal or even a requirement.
    >
    > This was so obvious that it was not stated explicitly. It's implied by
    > the name "normalization".

    Code has been deployed based on what was explicitly stated.

    >>> It happens that it affected my implementation of normalization that
    >>> I've made for my language. I already fixed it. Are you saying that I
    >>> should break it again?
    >>
    >> What are you using normalization for?
    >
    > I just provide it to users of my language, I'm not using it internally.

    Then I believe users will need a way to find out which normalization
    procedure you implement. If users want a StringPrep/IDN NFKC they
    need the old way, if they want NFKC for other purposes, they
    presumably will want the new way.

    >>> If this particular change can have practical consequence, it's more
    >>> probable that something will break with the old definition (because
    >>> a subsystem relied on idempotency) than with the new one.
    >>
    >> This is a conclusion that I have failed to reach.
    >
    > The old definition is internally inconsistent. In addition to not
    > having a well-defined concept of the property of being normalized,
    > it yields different equivalence relations from "NFC(x) = NFC(y)"
    > and "NFD(x) = NFD(y)", which should be the same.

    For a small set of strings that doesn't occur naturally, yes.

    >> Several IETF protocols are being modified to use StringPrep today,
    >> which use the old normalization. When/if StringPrep is updated to use
    >> the new normalization, those protocols appear to be faced with an
    >> upgrade problem.
    >
    > It's not a problem in practice, because the probability of meeting
    > problematic sequences is extremely small.

    That argument works both ways. If the strings don't occur naturally,
    there isn't a practical need to update the normalization procedure,
    except for internal nicety or theoretical robustness.

    I'm asking whether this theoretical improvement is worth creating
    problems for IDN/StringPrep implementations that need to worry about
    normalization stability between Unicode versions.

    The problem need to be addressed somewhere, why not do it in the
    Unicode specification instead of moving things down one layer into
    StringPrep/IDN and possibly other standards as well?

    I can see that I'm fighting a losing battle here, but I think this
    discussion is useful nonetheless.

    Thanks,
    Simon



    This archive was generated by hypermail 2.1.5 : Wed Jan 26 2005 - 18:04:24 CST