Re: Normalization Form KC for Linux

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Aug 18 1999 - 15:05:52 EDT

Next message: Markus Kuhn: "ASCII fallbacks for Unicode characters"
Previous message: A. Vine: "Re: comments: UTF-16, an encoding of ISO 10646 to Informational"
Maybe in reply to: Markus Kuhn: "Normalization Form KC for Linux"
Next in thread: Addison Phillips: "RE: Normalization Form KC for Linux"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Kuhn wrote:

> I was never too happy with the UCS implementation levels, and after
> reading Unicode Tech Report #15, I think I have now seen the light and I
> have just added in
>
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
>
> in section "How should Unicode be used under Linux?" the following
> paragraph:
>
> One day, combining characters will surely be supported under Linux, but
> even then the precomposed characters should be preferred over combining
> character sequences where available. More formally, the preferred way of
> encoding text in Unicode under Linux should be Normalization Form KC as
> defined in Unicode Technical Report #15
> <http://www.unicode.org/unicode/reports/tr15/>.
>
> I hope this recommendation meets general approval.

My only concern is that Normalization Form C (rather than KC) might
be more appropriate.

The examples in UTR #15 tend to make the Form KC look cleaner (compatibility
ligatures get normalized to a sequence, roman numeral compatibility
characters end up normalized to regular letters, half-width katakana
sequences get normalized), but you need to consider also what you
normalize away in Form KC. When you do the compatibility decomposition
for Form KC, you will lose distinctions between fullwidth and halfwidth
characters (which may cause a problem for interoperating with legacy
CJK systems). Circled symbols in the 32XX block get their circles
normalized away, which might not be the intention. Letterlike symbols
in the 21XX block get normalized away, which might also not be the
intention. Different width spaces all get normalized to SPACE. And
so on.

In my opinion, Form C is the more appropriate for general use on the
Internet (and in Linux). It normalizes canonical equivalences, but
leaves compatibility characters alone. Since most compatibility
characters are there for compatible interoperability with some
legacy character set, this might be less problematical than trying
to normalize them all away.

> I would even suggest
> that programs such as less and ls could be extended to replace
> characters on output by \xx hex escape sequences if they find in file
> names or text files characters that are not conforming to Normalization
> Form KC, such that these potential trouble-makers can be spotted more
> easily by users.
>
> It might be a very nice idea to have all the Unicode Normalization forms
> added to GNU recode or iconv.

Yes.

--Ken

>
> Markus

Next message: Markus Kuhn: "ASCII fallbacks for Unicode characters"
Previous message: A. Vine: "Re: comments: UTF-16, an encoding of ISO 10646 to Informational"
Maybe in reply to: Markus Kuhn: "Normalization Form KC for Linux"
Next in thread: Addison Phillips: "RE: Normalization Form KC for Linux"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT