Re: Normalization Form KC for Linux

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Fri Aug 27 1999 - 05:08:40 EDT


>As Rick was saying elsewhere in that message, there is a lot of confusion
>about the various forms a text can take during its life. We have to be able
>to tell the difference between internal data structures inside running
>programs, storage formats, publishing formats, input, rendering,...
>
>I believe that the sense of the meeting has been that properly decomposed
>characters are easiest to analyze and transform, and composed **glyphs**
>give the best results for rendering, when available. I don't see anything
>in those two ideas that requires coding of precomposed characters in
>internal data structures or storage. If this is being proposed for
>publishing, I think it will help in some cases while we all wait for more
>capable software written by people who believe in the difference between
>characters and glyphs.
>

What is a character?

In Swedish we have a character which in ISO 10646 has the code 0xc5.
This is a vowel which for English speaking persons look like
an A with a small ring above.
It is a single glyph, it is a single charcter, it is one letter, it
cannot be decomposed.
Likewise, for an English speaking person, "i" is one character though
the glyph looks like it is composed of a small vertical bar and a dot
above.

I very much prefer to have letters represented in my software as
one code value, instead of long compose sequences that need a lot
of software to manage.

If you use the compact normalized form C from Unicode for transmission
and storage, you get a compact form for storage, easy to handle because
you know the order of data. If you prefer to have everything decomposed
in your software, it is easy to decompose the data. On the other hand,
if everything is stored in unnormalized mized format, you need a lot
of software just to normalized the data before you can work on it, and
it will take up more storage and bandwidth during transmission.

I am sure there are languages with alphabets that a decomposed order
is natural, but I think there are more that composed and normalized
are much better to have for handling in software.

   Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT