Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Jan 26 2005 - 16:40:33 CST

  • Next message: Michael \(michka\) Kaplan: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

    Simon Josefsson <jas@extundo.com> writes:

    > The first, "internal-idempotency", is that NFKC(NFKC(x)) = x.

    You surely meant NFKC(NFKC(x)) = NFKC(x).

    > The second, "version-idempotency", is that NFKC3.2(NFKC4.0(x)) = x.

    I don't know how to fix this equation to make your claim meaningful.
    As stated, the equation is often false.

    If you mean NFKC3.2(NFKC4.0(x)) = NFKC3.2(x), then it's true after the
    fix (I think) but would not be true if NFKC4.0 was equal to NFKC3.2,
    so you are proposing to make things worse.

    If you mean NFKC3.2(NFKC4.0(x)) = NFKC4.0(x), then it's not true after
    the fix, but would not be true anyway if NFKC4.0 was equal to NFKC3.2.

    > It is crucial that normalization forms remain stable over time. That
    > is, if a string that does not have any unassigned characters is
    > normalized under one version of Unicode, it must remain normalized
    > under all future versions of Unicode. This is the backwards
    > compatibility requirement.

    The point is that with the old definition the concept of "normalized"
    is not well defined. Do you mean "a result of some normalization" or
    "such that normalizing it further doesn't change it"? They should be
    the same, but they were not. Now they will be.

    > Nowhere in the current document can I find any text that say that
    > internal-idempotency was a design goal or even a requirement.

    This was so obvious that it was not stated explicitly. It's implied by
    the name "normalization".

    >> It happens that it affected my implementation of normalization that
    >> I've made for my language. I already fixed it. Are you saying that I
    >> should break it again?
    >
    > What are you using normalization for?

    I just provide it to users of my language, I'm not using it internally.

    >> If this particular change can have practical consequence, it's more
    >> probable that something will break with the old definition (because
    >> a subsystem relied on idempotency) than with the new one.
    >
    > This is a conclusion that I have failed to reach.

    The old definition is internally inconsistent. In addition to not
    having a well-defined concept of the property of being normalized,
    it yields different equivalence relations from "NFC(x) = NFC(y)"
    and "NFD(x) = NFD(y)", which should be the same.

    > Several IETF protocols are being modified to use StringPrep today,
    > which use the old normalization. When/if StringPrep is updated to use
    > the new normalization, those protocols appear to be faced with an
    > upgrade problem.

    It's not a problem in practice, because the probability of meeting
    problematic sequences is extremely small.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 26 2005 - 16:42:08 CST