Re: Does Unicode 4.1 change NFC?

From: Kenneth Whistler (
Date: Mon Apr 04 2005 - 17:22:51 CST

  • Next message: David Starner: "Re: Tamil Aytham and the role of Unicode names"

    John Burger asked:

    > >> The problem will of course come when new UCD data is fed into an old
    > >> normaliser.
    > >
    > > Actually, it will not. If a Unicode normalizer was a Unicode 4.0
    > > normalizer, it will *stay* a Unicode 4.0 normalizer.
    > Even if it is fed new ==UCD== data?

    It depends on what Peter Kirk meant by a "normaliser" and
    by "UCD data".

    If by "normaliser" he means a normalizer generator that takes
    UCD data files as input and generates a normalizer process that
    corresponds to the version of UCD data files, then of course
    what you input matters.

    If by "normaliser" he means an already implemented normalizer
    process and by "new UCD data" he means text data corresponding
    to the new version of Unicode, then the behavior of the
    normalizer should not change.

    I wouldn't be surprised if a normalizer *generator* were broken
    by a new version of the UCD data files corresponding to a new
    version of Unicode. After all, most of them were broken by
    Unicode 3.1 in the first place, if you recall.

    But I consider a tool generator in a different class than
    a final application that an end user interacts with. Anybody
    who uses a tool generator and who then doesn't test the tool
    (in this case a normalizer process) that it outputs for
    conformance to the version of the standard it supposedly
    supports -- again deserves what they get. And if the tool
    generator breaks on a new generation of UCD data files, that
    should be a pretty good sign that they've got some work to
    do before it is going to produce a conformant tool.

    Note, for example, that anyone who tried to implement a
    fully generalized normalization generator based on UCD
    data files would have had it broken by the introduction
    of NormalizationCorrections.txt as a UCD data file in
    Unicode 3.2.0. That was a rather more serious departure in
    input than depending on the assumption that any character
    on the BMP would normalize completely to characters in the


    This archive was generated by hypermail 2.1.5 : Mon Apr 04 2005 - 17:24:10 CST