Re: The normalization form of the result of a dyadic operation.

From: Peter_Constable@sil.org
Date: Thu Feb 08 2001 - 17:40:26 EST


On 02/08/2001 11:20:27 AM "J M Sykes" wrote:

>When an standard conformaing SQL-implementation concatenates two
normalized
>UCS strings, then it is required that the result be normalized (noting
>Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).

Yes. It must be understood that a concatenated string is not guaranteed to
be normalised until it is explicitly normalised, regardless of the state of
the operand strings.

>My question is, supposing the NF of the two operands to be different, what
>should be the NF of the result?
>
>In its present state, our proposal specifies the result by referring to
the
>following table:
>
>Table A
>=======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKC NFD NFC
> NFKC |NFKC NFKC NFD NFC
> NFD |NFD NFD NFD NFC
> NFC |NFC NFC NFC NFC
>
>It has been suggested that the following would be preferable:
>
>
>Table B
>=======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKC NFKD NFKC
> NFKC |NFKC NFKC NFKD NFKC
> NFD |NFKD NFKD NFD NFC
> NFC |NFKC NFKC NFC NFC

I'm trying to make sense of these tables. Apparently, Table A consistently
applies a precedence of NFC > NFD > NFKC > NFKD. (I.e. the form for the
result should be the same as that of the operand with the highest form
according to this ordering.) Apparently, Table B gives a precedence to K
forms (K > ~K), and a precedence to C over D (C > D), but the first
ordering (K > ~K) is given higher priority over the second ordering (C >
D).

Actually, I don't think I'd go for either. Certainly, table B has a
concern: precedence given to the compatibility decompositions that occur in
NFKD and NFKC -- this results in removing distinctions that, in certain
situations, might be important. Table B should only be used with caution.

Both tables have an anomalous characteristic: if one operand is NFC, then
the result is always to be composed, but if one operand is NFKC and the
other is decomposed, then the result goes in two directions depending upon
the K or ~K property of the other operand. Why? That seems rather strange
to me. If the "Kompatibility" issue is orthogonal to the (de)composition
issue (which these tables follow, and which I think makes sense), then I
would think either C should always take precedence over D, or vice versa.
If we extract a portion from each table (and simpily because the operation
is commutative), we find

Sub-table A
=======
                |Operand 2
 Operand 1 |NFKD NFD
----------------+--------------
    NFKC |NFKC NFD

Sub-table B
=======
                |Operand 2
 Operand 1 |NFKD NFD
----------------+--------------
    NFKC |NFKC NFKD

Tables A and B could have just as readily had

Sub-table A.a
=======
                |Operand 2
 Operand 1 |NFKD NFD
----------------+--------------
    NFKC |NFKD NFC

Sub-table B.a
=======
                |Operand 2
 Operand 1 |NFKD NFD
----------------+--------------
    NFKC |NFKD NFKC

and I think that wouldn't have been any more or less motivated. It still
wouldn't make sense to me, though: I would have expected D to always have
precedence over C, as in Tables A.b and B.b:

Table A.b
=======
                |Operand 2
 Operand 1 |NFKD NFKC NFD NFC
 -----------------+------------------------
    NFKD |NFKD NFKD NFD NFD
    NFKC |NFKD NFKC NFD NFC
    NFD |NFD NFD NFD NFD
    NFC |NFD NFC NFD NFC

Table B.b
=======
                |Operand 2
 Operand 1 |NFKD NFKC NFD NFC
 -----------------+------------------------
    NFKD |NFKD NFKD NFKD NFKD
    NFKC |NFKD NFKC NFKD NFKC
    NFD |NFKD NFKD NFD NFD
    NFC |NFKD NFKC NFD NFC

or for C to always take precedence over D, as in Tables A.c and B.c:

Table A.c
=======
                |Operand 2
 Operand 1 |NFKD NFKC NFD NFC
 -----------------+------------------------
    NFKD |NFKD NFKC NFD NFC
    NFKC |NFKC NFKC NFC NFC
    NFD |NFD NFC NFD NFC
    NFC |NFC NFC NFC NFC

Table B.c
=======
                |Operand 2
 Operand 1 |NFKD NFKC NFD NFC
 -----------------+------------------------
    NFKD |NFKD NFKC NFKD NFKC
    NFKC |NFKC NFKC NFKC NFKC
    NFD |NFKD NFKC NFD NFC
    NFC |NFKC NFKC NFC NFC

(What a lot of alternatives!)

For the reason described above, I think compatibility decomposition should
be avoided if either operand did not use it (i.e. ~K > K). As for C vs. D,
I have a personal preference for D over C, but W3G has (with not invalid
reasons) chosen NFC as the preferred and recommended normalisation form in
any protocols that they create. As a result, I'd be inclined from all these
options to select Table A.c. It gives precedence to C over D, and it avoids
K unless both operands conform to K.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT