Re: Does Unicode 4.1 change NFC?

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 04 2005 - 16:05:13 CST

  • Next message: John Burger: "Re: Does Unicode 4.1 change NFC?"

    > >In any conformant Unicode 4.0.1 (or earlier) version of normalization,
    > >U+FACF normalizes to (tada!) U+FACF. If it doesn't, the normalizer
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > >isn't conformant. If sending U+FACF to such a normalizer crashes
    > >an application, then shame on the programmer.
    > >
    > >
    >
    > The problem will of course come when new UCD data is fed into an old
    > normaliser.

    Actually, it will not. If a Unicode normalizer was a Unicode 4.0
    normalizer, it will *stay* a Unicode 4.0 normalizer.

    By the Unicode 4.0 spec, U+FACF (and unassigned code point)
    normalized to U+FACF. (See above, emphasized now, to, well,
    *emphasize* the point.)

    If such a normalizer is fed Unicode 4.1 data, it will *still*
    proceed to normalize conformantly, according to the Unicode 4.0
    spec. The fact that there is an unassigned code point in the
    data, and that the normalizer is not up to the 4.1 spec is
    basically no issue to it. It just doesn't support any version
    past Unicode 4.0, just as advertised (presumably) on the box.

    > You have made much in the past of the need not to change the
    > normalisation algorithm,

    The normalization *algorithm* has not changed.

    Stability of normalized data has not been disturbed. Any
    data in a normalization form by the Unicode 4.0 spec is *still*
    in that normalized form by the Unicode 4.1 spec.

    That does not and never does mean that an implementation does
    not need to change to support extensions to the standard.

    > not to add new classes of exceptions etc so
    > that programs don't have to be rewritten for each new version, only the
    > data needs to be updated.

    In principle, one could write a normalization implementation to
    be completely data driven, so that once it were written, it
    could simply be handed the next version UCD data files, and it
    would do the "right thing" with them. In practice, most implementations
    predigest all the data and perform various internal optimizations
    for either table size or speed or both. Such implementations need
    to be updated when the standard is updated, and the implementers
    generally understand the maintenance versus performance tradeoffs
    they are making here.

    > The sort of outcome I might well expect to see
    > from this is a normaliser emitting surrogate pairs in UTF-8 or UTF-32.

    Well, if so, it is badly written, and probably non-conformant to
    begin with.

    What you *should* expect is:

    A Unicode 4.0 implementation will normalize U+FACF to U+FACF.

    A Unicode 4.0 implementation, if tested against a Unicode 4.1
    test data suite, will issue an exception (fail a test, whatever) for
    U+FACF.

    A Unicode 4.1 implementation will normalize U+FACF to U+2284A.

    Anything more or less than that is just bad software engineering.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Apr 04 2005 - 16:06:24 CST