Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 27 2003 - 19:56:34 CST


From: "Peter Kirk" <peterkirk@qaya.org>

> On 27/10/2003 10:31, Philippe Verdy wrote:
>
> > ...
> >
> >The bad thing is that there's no way to say that a superfluous
> >CGJ character can be "safely" removed if CC(char1) <= CC(char2),
> >so that it will preserve the semantic of the encoded text even
> >though such filtered text would not be canonically equivalent.
> >
> >
> Philippe, you have some interesting ideas here and in your previous
posting.
>
> I wonder if it would be possible to define a character with combining
> class zero which is automatically removed during normalisation when it
> is superfluous, in the sense that you define here. Of course that means
> a change to the normalisation algorithm, but one which does not cause
> backward compatibility issues.
>
> I guess what is more likely to be acceptable, as it doesn't require but
> only suggests a change to the algorithm, is a character which can
> optionally be removed, when superfluous, as a matter of canonical or
> compatibility equivalence. If we call this character CCO, we can define
> that a sequence <c1, CCO, c2> is canonically or compatibly equivalent to
> <c1, c2> if cc(c1) <= cc(c2), or if either cc(c1) or cc(c2) = 0. I am
> deliberately now not using CGJ as this behaviour might destabilise the
> normalisation of current text using CGJ. But there would be no stability
> impact if this is a new character.
>
> The advantage of doing this is that a text could be generated with lots
> of CCOs which could then be removed automatically if they are superfluous.
>
> I am half feeling that there must be some objections to this, but it's
> too late at night here to put my finger on them, so I will send this out
> and see what response it generates.

After rereading TR15, I think it will be difficult to define even a
compatibility equivalence for <c1, CCO, c2> and <c1, c2>:

[quote]
Canonical decomposition is the process of taking a string, recursively
replacing composite characters using the Unicode canonical decomposition
mappings (including the algorithmic Hangul canonical decomposition mappings,
see Annex 10: Hangul), and putting the result in canonical order.
Compatibility decomposition is the process of taking a string, replacing
composite characters using both the Unicode canonical decomposition mappings
and the Unicode compatibility decomposition mappings, and putting the result
in canonical order.
[/quote]

The problem is that we would like to have compositions required in the NFKC
form where superfluous CCO would be removed. But, as the algorithm is
currently standardized to take pairs of characters to compose them, this
would not work directly from the list of decomposition mappings in the UCD.

So the solution would have to be algorithmic (like for the Hangul Syllables
decompositions which are not listed in the UCD table, despite they are much
simpler as they work by pairs).

This would require a reform of the normative standard UTS#15.

Also we would like to avoid reinserting superfluous CCO's in NFKD forms.
This would require a new concept: decomposition exclusions (for now this
only exists for non decomposable characters which have singleton canonical
equivalents decomposable to them, and are defined implicitly). The existing
concept is the composition exclusion, some of them being listed separately
as they are script-specific, or exist because of post-composition version
(to maintain the stability of normalized forms).

However UTS#15 requires that even NFK* forms must be stable across versions.
As the new CCO character did not exist before, this requirement is not a
problem for us and it would have the benefit of forbidding to redecompose
existing <c1, c2> sequences into <c1, CCO, c2> in the NFKD form (the
composition of <c1, CCO, c2> into <c1, c2> does not have to be stable with
previous versions as it did not exist before).

This means that superfluous CCO's would need to be removed from _BOTH_ NFKD
and NFKC forms, and so we will need that:

for all c1, c2 in _assigned and valid_ Unicode codepoints (in any version),
    if CC(c1)>CC(c2)>0:
        NFKD(<c1, CCO, c2>) = NFKD(<c1, c2>)
        NFKC(<c1, CCO, c2>) = NFKC(<c1, c2>)
    else
        NFKD(<c1, CCO, c2>) = NFKD(<c1>)+<CCO>+NFKD(<c2>)
        NFKC(<c1, CCO, c2>) = NFKC(<c1>)+<CCO>+NFKC(<c2>)

So, there is absolutely no way to specify any such "compatibility
decomposition mapping" in the UCD: this would require an entry in the UCD
for the source pair <c1, c2>, and the three-character mapping would be
invalid for the decomposition (only valid and required for the composition).

It could not fit well in a external table, so this would need to be defined
algorithmically, and thus an amendment in UTS#15 for the compatiblity
composition _and_ compatibility decomposition algorithms. It's not
impossible as it does not contradict the stability pact. But this will
immediately cause some problems to implement it in a simple revision of
Unicode 4.0.x.

That's why it is too soon to define the CCO now: the amended UTS#15 first
needs to be approved that includes the CCO rules for NFK* forms. For the
interim, we can only recommend using CGJ, despite we cannot enforce the
"safe" removal of superfluous GJ occurrences within NFK* forms. As all
documents using CGJ will have to remain unchanged by the new algorithm
implementation, they will need to be reencoded with CCO after manual test,
the CGJ method being deprecated later in favor of CCO.

If we have to do all this, then it seems simpler to mandate the same
algorithm at the same time for both NFC and NFD forms (here also the NFD
form will have to _never_ insert any CCO in a source string that does not
contain it, due to the stability pact). The work to do is the same.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST