From: Kenneth Whistler (firstname.lastname@example.org)
Date: Mon Apr 04 2005 - 16:05:13 CST
> >In any conformant Unicode 4.0.1 (or earlier) version of normalization,
> >U+FACF normalizes to (tada!) U+FACF. If it doesn't, the normalizer
> >isn't conformant. If sending U+FACF to such a normalizer crashes
> >an application, then shame on the programmer.
> The problem will of course come when new UCD data is fed into an old
Actually, it will not. If a Unicode normalizer was a Unicode 4.0
normalizer, it will *stay* a Unicode 4.0 normalizer.
By the Unicode 4.0 spec, U+FACF (and unassigned code point)
normalized to U+FACF. (See above, emphasized now, to, well,
*emphasize* the point.)
If such a normalizer is fed Unicode 4.1 data, it will *still*
proceed to normalize conformantly, according to the Unicode 4.0
spec. The fact that there is an unassigned code point in the
data, and that the normalizer is not up to the 4.1 spec is
basically no issue to it. It just doesn't support any version
past Unicode 4.0, just as advertised (presumably) on the box.
> You have made much in the past of the need not to change the
> normalisation algorithm,
The normalization *algorithm* has not changed.
Stability of normalized data has not been disturbed. Any
data in a normalization form by the Unicode 4.0 spec is *still*
in that normalized form by the Unicode 4.1 spec.
That does not and never does mean that an implementation does
not need to change to support extensions to the standard.
> not to add new classes of exceptions etc so
> that programs don't have to be rewritten for each new version, only the
> data needs to be updated.
In principle, one could write a normalization implementation to
be completely data driven, so that once it were written, it
could simply be handed the next version UCD data files, and it
would do the "right thing" with them. In practice, most implementations
predigest all the data and perform various internal optimizations
for either table size or speed or both. Such implementations need
to be updated when the standard is updated, and the implementers
generally understand the maintenance versus performance tradeoffs
they are making here.
> The sort of outcome I might well expect to see
> from this is a normaliser emitting surrogate pairs in UTF-8 or UTF-32.
Well, if so, it is badly written, and probably non-conformant to
What you *should* expect is:
A Unicode 4.0 implementation will normalize U+FACF to U+FACF.
A Unicode 4.0 implementation, if tested against a Unicode 4.1
test data suite, will issue an exception (fail a test, whatever) for
A Unicode 4.1 implementation will normalize U+FACF to U+2284A.
Anything more or less than that is just bad software engineering.
This archive was generated by hypermail 2.1.5 : Mon Apr 04 2005 - 16:06:24 CST