Re: Regarding canonical combing class value for U+0F76 and similar characters (Unicode 6.2.0)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 18 May 2013 02:33:07 +0100

On Sat, 18 May 2013 02:02:07 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> Yes it is expected. And in fact very common in Unicode since long
> (there are in fact many "Mn" marks with combining class 0, this is
> not just for one script).
>
> A combining class 0 DOES NOT mean that the character will not be a
> non-spacing mark (or that it will be spacing), but just that it blocks
> reorderings under standard normalizations and for recognizing
> canonical equivalences.
>
> (see for example CGJ which also has combining class 0 and which is
> used mostly to insert such blocking behavior, without having any real
> semantic meaning by itself ; once the normalization step has been
> done, it can be discarded from the input stream in renderers or
> collators, except for special purposes like rendering CGJ with its
> own visible glyph in some "visible controls" edit mode)

It cannot be discarded when collation is used for sorting.

> Technically CGJ is not "Mn" (not a combining mark) but a "formatting
> control",

Wrong! Its general category is still Mn.

> but it still participates to the grouping of "default
> grapheme clusters", as if it was a combining mark -- and for most
> parts, it is an artefact of the encoding in the UCS, and considered
> foreign to the script by native writers, but it is also needed for
> compatibility reasons.

As far as I am aware, there are no 'compatibility' reasons for
preserving CGJ. It was initially convenient to assign CGJ the general
category Mn, and this has been found to be a happy coincidence, for it
can serve a useful role in disrupting various processes. The earliest
such role was in disrupting contractions in collation, for experience
has shown that it is more natural to treat a potential digraph as such
unless there is a mark to the contrary, rather than to require a
marker to show that it is a digraph.

Secondly, it has been found useful to preserve the arrangement of
combining marks.

> However, in many scripts, there exists true
> combining marks (Mn) that have combining class 0 (i.e. whose relative
> ordering in the encoded stream is semantically significant when they
> are used in conjunction with other reorderable combining marks).

It is usually the case that one order is right and the others are
wrong. The only situation I can think of is where a base character is
omitted, and even then I have no clear candidates.
Received on Fri May 17 2013 - 20:38:36 CDT

This archive was generated by hypermail 2.2.0 : Fri May 17 2013 - 20:38:37 CDT