Re: The normalization form of the result of a dyadic operation.

From: J M Sykes (mike.sykes@acm.org)
Date: Mon Feb 12 2001 - 12:10:56 EST

Next message: Mark Leisher: "RE: extracting words"
Previous message: jarkko.hietaniemi@nokia.com: "RE: extracting words"
Maybe in reply to: J M Sykes: "The normalization form of the result of a dyadic operation."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

We may need to come back to this, but for the moment I can summarise by
saying that I'm hoping we can agree that SQL should, like W3C Charmod, deal
only with NFC.

I interpolate a few more detailed comments below.

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: "Unicode List" <unicode@unicode.org>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Friday, February 09, 2001 8:00 PM
Subject: Re: The normalization form of the result of a dyadic operation.

> Mike Sykes asked:
>
> > When an standard conformaing SQL-implementation concatenates two
normalized
> > UCS strings, then it is required that the result be normalized (noting
> > Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).
> >
> > My question is, supposing the NF of the two operands to be different,
what
> > should be the NF of the result?
>
> My answer to this is: undefined.
>
> Like Mark, I have my doubts about the notion of having strings carry
around
> normalization tags. And I think it would be a bad direction to have
> something like the SQL standard even try to establish a bunch of
> tables to indicate the NF of various string operations applied to
> string parameters of varying NF.
>
> I understand that this is a very SQL-like thing to do, since the SQL
> standard has to define all kinds of datatype results for database
> operations applied to mixed data types.

This is, of course, the root of the problem.

> But I think it is a fundamental mistake to treat NFC, NFD, NFKC, and NFKD
> as if they were datatypes.
>
> UNISTRING + UNISTRING ==> UNISTRING
> STRING (i.e. VARCHAR) + UNISTRING ==> UNISTRING
>
> etc., are o.k.
>
> But:
>
> NFC + NFD ==> ??
> NFKC + NFD ==> ??
>
> This way lies madness, when normalization forms of strings are treated
> *as if* they were datatypes.

I think 'madness' is a bit strong, but would certainly agree with
'undesirable and hopefully unnecessary complexity'.

> As Peter pointed out, Unicode Normalization Forms are not closed even
> under simple operations like concatenation. So there is a reasonable
> amount of overhead required to maintain even a single normalization
> form consistent across arbitrary collections of string operations.
>
> In a database context, I think the only reasonable implementation
> approach is to either enforce a single normalization form in the
> database (and for that NFC would be the preference, with NFD the
> other alternative), or to do no normalization whatsoever. The latter
> choice would make sense primarily for exact preservation of data,
> as in TEXT fields, but would disallow accurate indexing. For most
> data, the customer preference is going to be for data to be treated
> as if it is normalized, even if what they hand off isn't, in fact
> normalized, and for that a database-wide dictum to maintain all
> Unicode data in either NFC or NFD is the way to go, as I see it.
> Enforcing NFC on input to the database ensures that optimizations
> and indexing will work correctly, and that, in turn ensures that
> database performance will be maintained. And data accuracy and
> database performance are what the customers pay us for.
>
> Cluttering up SQL with a bunch of requirements for some tabular
> matrix of implicit conversions between different normalization
> forms will, in my opinion, just lead to another feature of SQL that
> is not implemented in actual commercial implementations.

OK, OK!
>
> Note that not even the source document for the Unicode normalization
> forms, i.e. UAX #15, ventures off into the territory of trying to
> define closures across the various normalization forms for various
> arbitrary collections of string operations.
>
> Misha Wolf said:
>
> >
> > I am very keen that SQL and XML Query move in the same direction,
> > based on the:
> > Character Model for the World Wide Web 1.0
> > http://www.w3.org/TR/charmod
> > which is, in turn, based on NFC.
> >
> > Misha Wolf
> > W3C I18N WG Chair
> > >
> I concur with this sentiment. I think that SQL should first explore
> the approach of simply recommending that *all* SQL data of a UNICHAR
> data type be maintained in NFC, and then carefully explore the
> rationales for considering alternatives and relaxations of that
> rule in particular well-defined cases, rather than jumping right
> into mandating some complex implicit datatype conversion table for
> normalization forms.
>
I very much agree with that, as well.

> That said, I can see an argument for considering the possibility of
> maintaining connections with clients that have other than NFC data.
> But I think those should be considered as an extension of the concept
> of transcoding character sets. If I maintain data in NFC inside a
> database, but have a connection to a client that requests NFD, it
> should be relatively straightforward, in a Server-makes-right model,
> for the server to do the normalization on the way out and way in,
> just as it does now for, say, a Shift-JIS client talking to an
> EUC-JP server, or a 8859-2 client talking to a UTF-8 server, and
> so on. And in a Client-makes-right model, the same normalization
> on the way in and normalization on the way out can also be treated
> as a conversion in the client.

I should probably agree also with that, if I understood it ;-?

Mike.

Next message: Mark Leisher: "RE: extracting words"
Previous message: jarkko.hietaniemi@nokia.com: "RE: extracting words"
Maybe in reply to: J M Sykes: "The normalization form of the result of a dyadic operation."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT