Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 27 2003 - 18:16:18 CST


From: "Peter Kirk" <peterkirk@qaya.org>

> On 27/10/2003 10:31, Philippe Verdy wrote:
>
> > ...
> >
> >The bad thing is that there's no way to say that a superfluous
> >CGJ character can be "safely" removed if CC(char1) <= CC(char2),
> >so that it will preserve the semantic of the encoded text even
> >though such filtered text would not be canonically equivalent.
> >
> >
> Philippe, you have some interesting ideas here and in your previous
posting.
>
> I wonder if it would be possible to define a character with combining
> class zero which is automatically removed during normalisation when it
> is superfluous, in the sense that you define here. Of course that means
> a change to the normalisation algorithm, but one which does not cause
> backward compatibility issues.
>
> I guess what is more likely to be acceptable, as it doesn't require but
> only suggests a change to the algorithm, is a character which can
> optionally be removed, when superfluous, as a matter of canonical or
> compatibility equivalence. If we call this character CCO, we can define
> that a sequence <c1, CCO, c2> is canonically or compatibly equivalent to
> <c1, c2> if cc(c1) <= cc(c2), or if either cc(c1) or cc(c2) = 0. I am
> deliberately now not using CGJ as this behaviour might destabilise the
> normalisation of current text using CGJ. But there would be no stability
> impact if this is a new character.
>
> The advantage of doing this is that a text could be generated with lots
> of CCOs which could then be removed automatically if they are superfluous.

That's exactly the idea: The uses of CGJ in current texts may not adhere
strictly to this rule, and thus there would be objections for its automatic
removal when it is not necessary.

One note however: canonical equivalents cannot have more than a couple
of characters in the "stability pact". This means that <c1, CCO, c2> cannot
be made canonically equivalent (with the current definition) with <c1, c2>,
also because this equivalence is contextual and depends on two characters
instead of just one.

A canonical equivalent must either be a single character, or a single base
character (possibly precombined) and a combining character. Also the
result of the contraction must be a single character.

So, all we can do is to define compatibility equivalence between:
    <c1, CCO, c2>
and:
    <c1, c2>
if and only if:
    CC(c1) > CC(c2) > 0.

This won't affect the NFC and NFD conversion algorithms, but it can affect
the NFKC and NFKD conversion algorithms. This means that XML, SGML and
HTML are not affected by this change [ and the W3C is happy :-> ].



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST