Re: Normalization in panlingual application

From: Asmus Freytag (
Date: Thu Sep 20 2007 - 06:49:22 CDT

  • Next message: Asmus Freytag: "Re: Normalization in panlingual application"

    On 9/19/2007 7:01 PM, Philippe Verdy wrote:
    > Asmus Freytag wrote:
    >> The "K" series of normalization forms, by default, just "k"orrupt the
    > data.
    > I do agree with that,
    Glad to hear that.
    > but I wonder why NFKC/NFKD have been integrated within
    > the standard for conformance, given that it causes many wellknown problems.
    Some of the problems, I'm sure, were not well known in advance. Like the
    compatibility decompositions that these forms are based on, their main
    field of applicability was seen in identifier matching, a realm that
    traditionally supports only a subset of ordinary language.
    > It should at best have been just a non-mandatory recommendation, allowing
    > tailoring (even IDN no longer refers to it directly, and needed to redefine
    > its own foldings).
    That's because IDN is morphing beyond simple identifiers as
    traditionally understood for programming languages and the like. IDN is
    attempting to be closer to ordinary language, and that's why the
    limitations of NFKD/NFKC become apparent.
    > Anyway, why is NFKD/NFKC frozen as well as the compatibility mappings in the
    > UCD? Making these unmutable with the stability policy was not necessary. For
    > me, such mappings in the UCD are just informative, to document why these
    > compatibility characters were also encoded separately and how they differ
    > from the other characters referenced by the mapping.
    If you offer a specification, it's always useful to not allow options.
    Every option multiplies the set of legal, or valid mappings between
    input and output. Multiple options exponentially increase that set. With
    it, you not only increase the implementation and testing effort, but you
    increase the chance that two parties in an interchange do not support a
    compatible set of options.

    Normalization is about making interchange more reliable, by removing
    options. For example, applying NFD removes precomposed characters,
    reducing the number of ways in which the same information can be
    encoded. Adding options to the normalization forms protocols undoes one
    of their major benefits for reliable interchange.
    > NFKC/NFKD forms should have been specified like other foldings, even less
    > normative than case mappings/foldings (that cause much less problems). It's
    > not even goodasit does not preserve linguistic differences, generates severe
    > corruption of texts, and is not tailorable.
    A better way to say this is that for many implementations, trying to use
    *normalization* to deal with the issue of compatibility characters and
    other ignorable, though real, distinctions is not the best way. The
    Unicode Consortium realized this several years ago when the work on
    UTS#30 on character foldings was begun. That is the direction in which
    implementers should turn.

    The existing NFKC/NFKD should be limited to specific uses in the context
    of identifier matching, for which they were originally intended.

    It would be very counterproductive to pursue the discussion as if the
    goal was to improve these forms in the context of *normalization*. That
    would be a giant step backwards, in fact, it would negate the more
    recent developments of a framework that sees *character foldings* as the
    core aspect -- such a new framework, for the first time, is also able to
    seamlessly integrate case folding. Something *normalization* cannot, and
    should not do.

    Insofar as you current message is framed as if it was trying to argue
    for an improvement of NFKC/NFKD it is therefore doing a disservice, by
    distracting from the more productive, and more recent developments.


    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 06:52:26 CDT