Re: The normalization form of the result of a dyadic operation.

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Feb 09 2001 - 15:22:20 EST


Mike Sykes asked:

> When an standard conformaing SQL-implementation concatenates two normalized
> UCS strings, then it is required that the result be normalized (noting
> Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).
>
> My question is, supposing the NF of the two operands to be different, what
> should be the NF of the result?

My answer to this is: undefined.

Like Mark, I have my doubts about the notion of having strings carry around
normalization tags. And I think it would be a bad direction to have
something like the SQL standard even try to establish a bunch of
tables to indicate the NF of various string operations applied to
string parameters of varying NF.

I understand that this is a very SQL-like thing to do, since the SQL
standard has to define all kinds of datatype results for database
operations applied to mixed data types.

But I think it is a fundamental mistake to treat NFC, NFD, NFKC, and NFKD
as if they were datatypes.

UNISTRING + UNISTRING ==> UNISTRING
STRING (i.e. VARCHAR) + UNISTRING ==> UNISTRING

etc., are o.k.

But:

NFC + NFD ==> ??
NFKC + NFD ==> ??

This way lies madness, when normalization forms of strings are treated
*as if* they were datatypes.

As Peter pointed out, Unicode Normalization Forms are not closed even
under simple operations like concatenation. So there is a reasonable
amount of overhead required to maintain even a single normalization
form consistent across arbitrary collections of string operations.

In a database context, I think the only reasonable implementation
approach is to either enforce a single normalization form in the
database (and for that NFC would be the preference, with NFD the
other alternative), or to do no normalization whatsoever. The latter
choice would make sense primarily for exact preservation of data,
as in TEXT fields, but would disallow accurate indexing. For most
data, the customer preference is going to be for data to be treated
as if it is normalized, even if what they hand off isn't, in fact
normalized, and for that a database-wide dictum to maintain all
Unicode data in either NFC or NFD is the way to go, as I see it.
Enforcing NFC on input to the database ensures that optimizations
and indexing will work correctly, and that, in turn ensures that
database performance will be maintained. And data accuracy and
database performance are what the customers pay us for.

Cluttering up SQL with a bunch of requirements for some tabular
matrix of implicit conversions between different normalization
forms will, in my opinion, just lead to another feature of SQL that
is not implemented in actual commercial implementations.

Note that not even the source document for the Unicode normalization
forms, i.e. UAX #15, ventures off into the territory of trying to
define closures across the various normalization forms for various
arbitrary collections of string operations.

Misha Wolf said:

>
> I am very keen that SQL and XML Query move in the same direction,
> based on the:
> Character Model for the World Wide Web 1.0
> http://www.w3.org/TR/charmod
> which is, in turn, based on NFC.
>
> Misha Wolf
> W3C I18N WG Chair
>

I concur with this sentiment. I think that SQL should first explore
the approach of simply recommending that *all* SQL data of a UNICHAR
data type be maintained in NFC, and then carefully explore the
rationales for considering alternatives and relaxations of that
rule in particular well-defined cases, rather than jumping right
into mandating some complex implicit datatype conversion table for
normalization forms.

That said, I can see an argument for considering the possibility of
maintaining connections with clients that have other than NFC data.
But I think those should be considered as an extension of the concept
of transcoding character sets. If I maintain data in NFC inside a
database, but have a connection to a client that requests NFD, it
should be relatively straightforward, in a Server-makes-right model,
for the server to do the normalization on the way out and way in,
just as it does now for, say, a Shift-JIS client talking to an
EUC-JP server, or a 8859-2 client talking to a UTF-8 server, and
so on. And in a Client-makes-right model, the same normalization
on the way in and normalization on the way out can also be treated
as a conversion in the client.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT