Normalization Form KC considered harmful

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Aug 19 1999 - 22:06:03 EDT


Blindly "Normalizing" away distinctions between all 'compatibility'
variants of characters it a very dangerous thing to do. Compatibility
variants have distinct properties and in many cases the compatibility
relation is not one of proper 'equivalence' but more of a 'based-on'
relationship.

In many cases, even if one choses to normalize the *character codes* with
form KC, one would still need to preserve the property (appearance)
distinction by other means (font tags, rich text attributes) and all
compatibility mappings carry suggestive property relations with them.

There truly is a difference between E=mc2 and E=mc<super>2 ;-)

A./

At 02:03 PM 8/18/99 -0700, you wrote:
>
>
>from the context i take it you wanted to say "...good reasons to choose C
>instead of KC.", right?
>
>markus
>
>
>Francois Yergeau <yergeau@alis.com> on 99-08-18 13:22:36
>
>To: Unicode List <unicode@unicode.org>
>cc:
>Subject: Re: Normalization Form KC for Linux
>
>
>
>
>
>À 12:20 1999-08-18 -0700, Kenneth Whistler a écrit :
>>Markus Kuhn wrote:
>>> encoding text in Unicode under Linux should be Normalization Form KC as
>>> defined in Unicode Technical Report #15
>>> <http://www.unicode.org/unicode/reports/tr15/>.
>>
>>My only concern is that Normalization Form C (rather than KC) might
>>be more appropriate.
>
>Form C is in fact the form chosen by the W3C "Character Model for the WWW"
>(http://www.w3.org/TR/WD-charmod). This is not final (still a WD - working
>draft) but is likely to stick, IMHO. I think that Linux should have good
>reasons to choose KC instead of C.
>
>>In my opinion, Form C is the more appropriate for general use on the
>>Internet (and in Linux).
>
>Yep.
>
>
>--
>François Yergeau
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT