Re: Normalization Form KC for Linux

From: Asmus Freytag (
Date: Thu Aug 19 1999 - 22:06:30 EDT

>... The W3C "Character Model for
>the WWW" also says that compatibility characters (i.e. things that are
>removed when applying compatibility decomposition) are discouraged.
>Something similar is most probably appropriate for Linux. There are
>definitely various ways and places to 'implement' this

I am still disturbed by seeing blanket statements like that. They imply
that there is a precise procedure that was used to unequivocably identify
compatibility characters, and furhtermore that the desirability of
maintaining distinctions between a characters that have a compatibility
mapping to each other is always sero (0).

The way compatibility mappings were introduced (essentially 'sold to') the
technical committee was as an 'aid to implementers', not as a looming sword
of deprecation. Therefore, one must be careful not to imply that they
represent a consensus of the Unicode Technical Committee to discourage ALL
characters that have compatibility mappings -- in fact the set of these
mappings has never been reviewed with that aim.

The letterlike symbols block is full of characters that are mapped to their
corresponding letter with a compatibility mapping, without regards to the
fact that 'normalizing' a 'hilber space' (BLACK LETTER CAPITAL H) into a
regular H seriously distorts not just the appearance, but the semantics of
the text.

>In charlint [],
>a perl program to do Normalization and other checks, my plan is to
>make KC available, but also to allow various variants, e.g. only
>normalize away superscripts/subscripts, and so on.

In some cases, the semantic difference can faithfully be captured in rich
text via a formatting attribute. Super/sub are primary candidates. In order
to do this right, however, it takes more than to run the normalization form
algorithm; you also have to insert the formatting codes into the stream.
For HTML, where super and sub can be expressed, that is easy, and may
ultimately work for a larger number of compatibility variations. In
terminal emulators (where this thread originated) that may not at all be
the case.

A simple to understand case are the spaces with typographic width. In TeX,
each of these Unicode characters corresponds to a primitive e.g. \quad.
Therefore, if converting text to TeX input, it may in fact be appropriate
to convert EM QUAD to \quad. It's certainly NOT appropriate to convert it
to SPACE (and yet that is the compatibility mapping for it). In the case of
these spaces the misunderstanding arises from the confusion of their two
uses in hot-metal typography. On the one hand, they were used to expand and
fill spaces - the operation that modern software handles algorithmically,
without the need to insert space characters. On the other hand, they were
used to achieve specific and yet conventional spacings in certain contexts,
see e.g. the uses for \quad described in the TeX book. Modern sofware
normally neither addresses this question algorithmically, nor with
formatting styles. (In the case of TeX, all characters are commands, and
all macros can be mere input conventions for characters not on the
keyboard, mapping \quad to a character code in Unicode is therefore quite
justified). Therefore, the use of the typorgraphical spaces should be
*en*couraged for the second use, while *dis*couraged for the former. This
is something that no simpleminded cry like "lets use form KC" can ever do
for you.

There are some bad duplicates in the standard where settling on a single
character for an all-Unicode environment makes a lot of sense.


and similar. A RING is already a canonical equivalence for this reason. Mu
might have become one, had the legacy character set for MICRO SIGN not have
been 8859-1. There's probably a good chance that the committee could agree
to what the list of these direct duplicates is.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT