Re: Normalization Form KC for Linux

From: Frank da Cruz (fdc@watsun.cc.columbia.edu)
Date: Mon Aug 30 1999 - 15:22:49 EDT


> Maybe I really should shut up... I guess I'm bitterly disappointed in how
> the Unix and Posix community has not grasped the Unicode textual concepts
> and progressed or led the way in all of this. The community seems so
> insular and fossilized, when there are so many good things about Unix that
> have been poorly imitated by other popular platforms. These days the
> industry is moving right along doing all kinds of interesting display and
> many scripts & languages, while the academic Unix (and Posix) folks are
> complaining that it's too hard or can't be done at all.[1]
>
I think this is reflective of the overall situation with computing today.
You can only change what you can control. In a monolithic environment like
Windows or the Macintosh, a single company has control and can do what it
likes, but perhaps more to point, these are closed boxes in which the
application has more or less direct access to the keyboard, screen, fonts,
and font info -- all the pieces of the puzzle.

Contrast this with Unix. First of all (obviously) there is not just one
Unix, but many of them (the UNIX C-Kermit makefile alone currently contains
about 500 targets). Nobody controls all this. Each vendor goes their own
way at their own pace. The many well-known utilities (command-line or
"video") have long since "forked". The existing code base is staggering,
and most of it is nondisclosed (Linux, *BSD, etc, are the exception (to
"nondisclosed", if not to "forked")).

Makers of third-party applications for Unix (and VMS, etc), if they want to
move forward, can't (in most cases) depend on the underlying platforms for
assistance. Even when they can, such assistance is inconsistent, forcing
them to develop their own portable tools and libraries, which tend to meet
their immediate needs but fall short of Nirvana.

Perhaps more to the point, however, is the fact that Unix (and VMS and other
"traditional" platforms) are open to many kinds of access: the workstation
console, usually some sort of GUI (also on the console), X (on the console
or from a remote X server), and then plain old character-mode remote access
via modem, Telnet, Rlogin, X.25 PAD, and the like. The latter mode, which
is branded "legacy" as if it had no value or place in the modern world, is
(I like to maintain, and I think with good reason) seeing wider use now than
ever before and although many wish it would go away, others would like to
stay active in this area and serve the people who depend on it, not only for
old time's sake, but also because it is a legitimate, viable, and open form
of access that everybody should be able to fall back upon as the the more
advanced and "interesting" forms change out from under them with bewildering
speed.

When access is this open -- which is a *good* thing -- no particular entity
has control over the user interface. It is a matter of coordinating the
behavior of intrinsically unrelated processes. So questions come up here
that never bother us when we are writing (say) a word processor. Which
end handles bidirectionality of Hebrew? Which end is responsible for the
detailed appearance of the screen? And now the questions of pre- and de-
composition.

Makers of third party applications only control one piece. The underlying
platform is likely not to have any Unicode support at all (VMS, most UNIXes,
IBM mainframes, etc), so the extent to which we support Unicode in our
applications depends on the hosts that we access with them. In the case of
terminal emulation (xterm, Kermit, etc), if the host is not executing any
form of BIDI algorithm, or ensuring some canonical form for composed
characters, etc (since it is totally ignorant of such matters), it does not
necessarily follow that the terminal must compensate, since for applications
where the screen is treated as a matrix of boxes in which the location of
different items must be known and fixed (and this can include dumb scrolling
applications that display text in columns), the host and terminal must
cooperate.

ISO 10646 includes the concepts of levels of compliance, including
Implementation Level 1 in which combining characters are not allowed.
Unicode Normalization Form C tends to amount to the same thing. If these
"subsets" were not to be used, they shouldn't have been defined. But in
fact, I believe they are useful in open-access environments where control is
distributed among "loosely cooperating" processes. Perhaps there is indeed
a tradeoff between open access and the ability to support complex scripts --
if not in theory, then almost certainly in practice.

Of course, we do have one example of Unix taken to the next level: Plan 9.
But even there -- where all text, even internally, is UTF-8 -- we still see
no provision for BIDI or combining sequences: Implementation Level 1 in
action. Everyone agrees it would be better to have no restrictions, but so
far I don't think anybody has considered the plain-text terminal-host
access model sufficiently to find a way around them.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT