Re: Does Unicode 4.1 change NFC?

From: Andrew C. West (
Date: Tue Apr 05 2005 - 05:25:20 CST

  • Next message: Andrew C. West: "Re: Macrons"

    I have been listening with increasing incredulity to Peter's claims that the
    Unicode standard should be constrained by *theoretical* problems resulting from
    invalid assumptions on the part of bad programmers. By the same reasoning no
    international standard should mandate four-digit years, as "bad programmers"
    used to be in the habit of storing years as two digits only, on the mistaken
    assumption that the end of the world was more likely than humankind managing to
    survive to the next century.

    On Tue, 05 Apr 2005 10:33:26 +0100, Peter Kirk wrote:
    > What I mean is a program which makes a proper separation between program
    > and data, which implements the Unicode normalisation *algorithm* (for a
    > particular version of Unicode) but uses the Unicode character *data*, as
    > well as the text data to be normalised, as part of its input. I don't
    > know of any normalisation program which works in this way, and in this
    > case efficiency may override good programming practice

    As any "good" software engineer knows, bad programming practice is never

    > - although it
    > should be possible to compile the UCD normalisation data in a way which
    > can be used efficiently. But I do know of other programs which
    > effectively update themselves automatically with the latest version of
    > the UCD.
    > Of course if the algorithm is changed from one version of Unicode to
    > another, as it was when NormalizationCorrections.txt was added to the
    > standard, then the program needs to be updated, and the results of using
    > the new UCD data with the old algorithm are unlikely to be correct. But
    > from 4.0.0 to 4.1.0 there has not, I think, been an advertised change to
    > the algorithm, and so people might expect the normalisation program to
    > continue to work.

    As a closet software engineer I have some experience in both writing and testing
    software, and I would suggest that any normalization software that is not fully
    retested when it is updated to a new version of Unicode should be avoided like
    the plague. In my implementation of normalization I do not assume 16 bit
    characters will be normalized to 16 bit values, and I did not expect my
    implementation to be broken by the new version of Unicode, yet for the sake of
    good programming practice I did fully retest against the normalization test data
    ... which was a good thing, as it did unexpectedly fail the first time round,
    but that was due to PRI-29 -- which is an advertised change to the normalization
    algorithm that I had not been aware of.

    > I agree that they should test it before use with a new
    > version of Unicode, but I don't believe that all programmers are as
    > careful as Doug and Jill in such matters.

    Then you should be careful not to buy any software from such people.

    > There is a particular danger with the new fashion of programs
    > automatically updating themselves over the Internet - and sometimes
    > breaking themselves in the process, as I have discovered to my cost.

    I know of applications that automatically update simple lists of Unicode data
    over the internet (e.g. Microsoft's Keyboard Layout Creator which can update the
    Unicode name list from the Unicode site), but I suspect that at present no
    application that does complex Unicode processing such as normalization simply
    downloads a new copy of the UCD data. In the future it is quite possible that
    applications will be able to download new versions of Unicode data files, and
    rewrite themselves to store the new data internally, but any such application
    would need to be tested even more thoroughly than an ordinary application, and
    the prorammers better be damned sure that they are not making any unwarranted


    This archive was generated by hypermail 2.1.5 : Tue Apr 05 2005 - 05:27:14 CST