Re: Normalization Form KC for Linux

From: Rick McGowan (rmcgowan@apple.com)
Date: Fri Aug 27 1999 - 17:09:11 EDT


Juliusz Chroboczek <jec@dcs.ed.ac.uk>...:

> It is expected that simple applications will only be able to accept
> precomposed forms

I'd have to ask Why?

> Complex applications are still expected to accept arbitrary combining
> characters; they just should avoid producing them whenever possible.

Why? That's about the opposite of what I'd argue. In my experience most of
the drudgery and complexity of display processing for Unicode is in dealing
with the multiple spellings; not with just decomposed or composed sequences.

I guess maybe I should just shut up because my argument is really about
something different than normalization itself, it's about architectures that
require applications to care about particular details of data normalization.

What really appears to be going on in the world of Unix is that generally in
these systems the "legacy" or existing methods of string & character
handling are being bolstered to deal with this new kind of data for which
they are an inappropriate level of API. Instead of architecting them to
remove the need for application programmers to worry about all this detail,
the detail is being exported to the programmer in the same way that it was
when the encodings were "simpler". I think it's the wrong way to go about
the architecture.

As I see it, systems that require "all" applications to mess around with the
low-level details of what is or is not stored as a combining sequence in
some string that's passing through some process is mis-architected from the
start. Only the lowest level of data-streaming and I/O of file formats
should be dealing with that.

GUI & UI systems that sit on top of Unix foundations appear in general to be
architected in ways that expose details, like composition/decomposition
normalization of the data, excruciating details of codesets and data formats,
that should be of no concern to "applications" written on theose platforms
Unfortunately, the architects tend to get hung up on how to expose these
details by extensive APIs, and argue a lot about details that should be of
as little concern to "application programs" as assembly language is to Java
programs.

If one is going to re-write the set of typical Unix foundation-level tools,
I think there are better ways to write them and different kinds of API that
are more appropriate for better abstraction away from the minutiae of
character encodings and normalization. That would free the application
programmer from such details, instead of causing the application programmer
to be acutely concerned with such details.

So when I see something like this:

> One day, combining characters will surely be supported under Linux,
>...
>> More formally, the preferred way of encoding text in Unicode under
>> Linux should be Normalization Form KC as defined in Unicode
>> Technical Report #15

It makes me cringe. <rant> This is saying that for everything written on
this entire OS -- all the UI, the tools, protocols, applications, etc. that
should be the "preferred" way of encoding simply because the display model is
broken and the architects have been going in the wrong direction for years
and wish to continue down that path because of the overwhelming weight of
their legacy code. </rant>

I think it's more appropriate to leave the specification of normalization
requirements up to particular protocols or functional groups, not "Unicode
under Linux" as a whole. In the long run, Linux would be much better off
going the opposite direction for most string & display handling. In my
opinion.

Enough ranting for the day...

        Rick



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT