Re: Normalization Form KC for Linux

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Fri Aug 27 1999 - 17:32:56 EDT


> Rick McGowan <rmcgowan@apple.com>:
>
> >> More formally, the preferred way of encoding text in Unicode under
> >> Linux should be Normalization Form KC as defined in Unicode
> >> Technical Report #15
>
> RM> Gosh, I don't approve. And I've been using Unix systems for many
> RM> years. The most flexible kind of implementation would prefer
> RM> decomposed sequences. In any case, enlightened systems would
> RM> accept anything and massage as needed to fit the particular
> RM> application instead of forcing (or "suggesting") the user to run
> RM> everything through the meat grinder first...
>
> As I understand it, Markus was speaking about the interchange formats,
> including, but not limited to, file formats and IPC formats.
>
And, I think, xterm and the Linux console driver.

> It is
> expected that simple applications will only be able to accept
> precomposed forms, while enlightened ones (I like the term) will
> accept anything. Therefore, requesting that applications *write*
> precomposed forms in preference to combining characters maximises the
> chances of interchange between simple and complex applications.
> Complex applications are still expected to accept arbitrary combining
> characters; they just should avoid producing them whenever possible.
>
This raises issues discussed long ago. The model of base character
followed by nonspacing diacritics (as opposed to the other way around)
does not mesh well with terminal/host communication, where incoming
characters must be displayed in realtime. A letter A arrives and the
terminal displays it in the current position and moves the cursor to the
next position, which might be on a new line due to screen wrap, and
this, in turn might cause scrolling, in some cases even off the screen
due to narrow vertical margins. Then a nonspacing acute accent arrives.
At this point, the terminal has to find where it left the A and change
it to something else, but this time avoid the wrap since it was done
previously, and then put the cursor back where it was before. (The
situation might be even more confusing when the host controls wrapping.)

This is an awful lot of work and screen changing for little benefit when
precomposed characters are already available. In any case, changing a
character after having already drawn it is not the best "human
engineering" -- terminal users are not accustomed to having to reread
text already read in case it changed, and this will be especially
noticeable when a congested network delays the arrival of a diacritic.

At the very least the redrawing of terminal-screen characters is likely
to introduce unwanted and perhaps harmful flicker (I'm sure you've all
read about the dangers of the "critical flicker frequency", e.g. to
those who drive along those picturesque tree-lined French country roads
at sunset :-).

There are no such problems when reading Unicode data from a file, since
we can always look ahead and collect all the diacritics before deciding
which character to show, with no delays or deadlocks.

Very few terminals are designed to allow composition of characters and
the few that are do so for good reason (e.g. the ALA bibliographic
character set, APL). I can't say whether composition is accomplished
by having the combining characters come before or after the base
character in these cases, but I suspect it's before.

This is not to grumble about the final Unicode / ISO 10646 design, but
suggest that there can be valid reasons for preferring precomposed
versions of characters to decompositions when there is a choice and
perhaps even requiring it in certain applications such as terminal
sessions.

There are countless (not as in "infinite" but as in "who can count?")
versions of UNIX and lots of other non-UNIX platforms that use
traditional character sets where we'd like to see Unicode make some
headway. But there is little chance we can expect the keepers of
all these diverse platforms to rip them apart from the bottom up to
replace the character handling model at every level to accommodate
composition of characters -- even if that's the right thing to do --
without breaking accessibility to their platforms by users of
traditional methods. It might be better to let Unicode get its
foot in the door without upsetting everything and then grow of its
own accord -- i.e. according to user demand.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT