Re: The normalization form of the result of a dyadic operation.

From: Mark Davis (markdavis34@home.com)
Date: Fri Feb 09 2001 - 13:12:31 EST


The whole principle of tagging individual strings with NF* is a bit odd to me; not sure I like it. The K forms in particular are really a folding operation, much like casing. I would not expect to find a model where someone tagged every string in a database with its Case, and then had some elaborate system in every function involving strings so that the result of any operation could be successfully tagged with Upper or Lower. Seems not well motivated.

As for D vs C, I don't know that there is a huge advantage to tagging, vs. just picking one of them consistently all the time. And I can see many drawbacks in having to maintain the tags all the time, and handle mixed operations. There are advantages to the W3C approach; just always keep the data in one form.

Given the model you have, however, I think Peter's A.c table is well thought out. There are implications for any string operation, not just concatenation. The operations with a single string (like uppercasing) are fairly straightforward: stay in the same form. Substringing too (which may require some fixup at the ends), upper/lowercasing, etc.

With multiple strings in a function (not just two), you have to have a consistent output. Given the constraints you have, I think Peter's rules are good, and can be easily extended:

1. If all are K, retain the K in the output; otherwise don't.
2. If all are D, retain the D in the output; otherwise convert to C.

However, there are very important exceptions; look at binary comparison of strings. You do not need to worry about differences in Ks because an NFKC string *is* NFC; an NFKD string *is* NFD; no conversion necessary. But to preserve transitivity you ALWAYS have to pick either a C or a D; no matter what the input. So you have to (logically at least) choose a single common form, C or D. That is, if C is the common form, then comparing an NF*D to an NF*D -- even though they are the same form -- you *have* to map both to NF*C.

For binary comparison, C is probably the best choice, since it matches more data and thus requires less processing. Even though D produces somewhat better results, binary comparison will simply not match user expectations anyway -- it is more for internal structures, where you need *some* fast, consistent ordering but it does not need to be end-user oriented).

Mark

----- Original Message -----
From: <Peter_Constable@sil.org>
To: "Unicode List" <unicode@unicode.org>
Sent: Thursday, February 08, 2001 14:32
Subject: Re: The normalization form of the result of a dyadic operation.

>
> On 02/08/2001 11:20:27 AM "J M Sykes" wrote:
>
> >When an standard conformaing SQL-implementation concatenates two
> normalized
> >UCS strings, then it is required that the result be normalized (noting
> >Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).
>
> Yes. It must be understood that a concatenated string is not guaranteed to
> be normalised until it is explicitly normalised, regardless of the state of
> the operand strings.
>
>
>
> >My question is, supposing the NF of the two operands to be different, what
> >should be the NF of the result?
> >
> >In its present state, our proposal specifies the result by referring to
> the
> >following table:
> >
> >Table A
> >=======
> > |Operand 2
> > Operand 1 |NFKD NFKC NFD NFC
> > -----------------+------------------------
> > NFKD |NFKD NFKC NFD NFC
> > NFKC |NFKC NFKC NFD NFC
> > NFD |NFD NFD NFD NFC
> > NFC |NFC NFC NFC NFC
> >
> >It has been suggested that the following would be preferable:
> >
> >
> >Table B
> >=======
> > |Operand 2
> > Operand 1 |NFKD NFKC NFD NFC
> > -----------------+------------------------
> > NFKD |NFKD NFKC NFKD NFKC
> > NFKC |NFKC NFKC NFKD NFKC
> > NFD |NFKD NFKD NFD NFC
> > NFC |NFKC NFKC NFC NFC
>
>
>
>
> I'm trying to make sense of these tables. Apparently, Table A consistently
> applies a precedence of NFC > NFD > NFKC > NFKD. (I.e. the form for the
> result should be the same as that of the operand with the highest form
> according to this ordering.) Apparently, Table B gives a precedence to K
> forms (K > ~K), and a precedence to C over D (C > D), but the first
> ordering (K > ~K) is given higher priority over the second ordering (C >
> D).
>
> Actually, I don't think I'd go for either. Certainly, table B has a
> concern: precedence given to the compatibility decompositions that occur in
> NFKD and NFKC -- this results in removing distinctions that, in certain
> situations, might be important. Table B should only be used with caution.
>
> Both tables have an anomalous characteristic: if one operand is NFC, then
> the result is always to be composed, but if one operand is NFKC and the
> other is decomposed, then the result goes in two directions depending upon
> the K or ~K property of the other operand. Why? That seems rather strange
> to me. If the "Kompatibility" issue is orthogonal to the (de)composition
> issue (which these tables follow, and which I think makes sense), then I
> would think either C should always take precedence over D, or vice versa.
> If we extract a portion from each table (and simpily because the operation
> is commutative), we find
>
> Sub-table A
> =======
> |Operand 2
> Operand 1 |NFKD NFD
> ----------------+--------------
> NFKC |NFKC NFD
>
> Sub-table B
> =======
> |Operand 2
> Operand 1 |NFKD NFD
> ----------------+--------------
> NFKC |NFKC NFKD
>
>
> Tables A and B could have just as readily had
>
> Sub-table A.a
> =======
> |Operand 2
> Operand 1 |NFKD NFD
> ----------------+--------------
> NFKC |NFKD NFC
>
> Sub-table B.a
> =======
> |Operand 2
> Operand 1 |NFKD NFD
> ----------------+--------------
> NFKC |NFKD NFKC
>
> and I think that wouldn't have been any more or less motivated. It still
> wouldn't make sense to me, though: I would have expected D to always have
> precedence over C, as in Tables A.b and B.b:
>
> Table A.b
> =======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKD NFD NFD
> NFKC |NFKD NFKC NFD NFC
> NFD |NFD NFD NFD NFD
> NFC |NFD NFC NFD NFC
>
> Table B.b
> =======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKD NFKD NFKD
> NFKC |NFKD NFKC NFKD NFKC
> NFD |NFKD NFKD NFD NFD
> NFC |NFKD NFKC NFD NFC
>
> or for C to always take precedence over D, as in Tables A.c and B.c:
>
> Table A.c
> =======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKC NFD NFC
> NFKC |NFKC NFKC NFC NFC
> NFD |NFD NFC NFD NFC
> NFC |NFC NFC NFC NFC
>
> Table B.c
> =======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKC NFKD NFKC
> NFKC |NFKC NFKC NFKC NFKC
> NFD |NFKD NFKC NFD NFC
> NFC |NFKC NFKC NFC NFC
>
>
> (What a lot of alternatives!)
>
> For the reason described above, I think compatibility decomposition should
> be avoided if either operand did not use it (i.e. ~K > K). As for C vs. D,
> I have a personal preference for D over C, but W3G has (with not invalid
> reasons) chosen NFC as the preferred and recommended normalisation form in
> any protocols that they create. As a result, I'd be inclined from all these
> options to select Table A.c. It gives precedence to C over D, and it avoids
> K unless both operands conform to K.
>
>
>
> - Peter
>
>
> ---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>
>
> - Peter
>
>
> ---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT