Re: Normalization Form KC for Linux

From: Rick McGowan (rmcgowan@apple.com)
Date: Fri Aug 27 1999 - 22:36:04 EDT


Taking up another bit of the puzzle... There are two types of "tool"
processes that are used in Unix environments. One is the kind of program
that processes some data and produces "lines" of "plain text" in some
encoding -- examples are shells and other command line interpreters, ls, cat,
echo, etc. The other type is the cursor-positioning intensive application,
like "vi", "emacs" and others.

There isn't any reason why you can't have line-oriented text display
applications, like terminal emulators without cursor positiong, which can run
a shell, display directory listings, etc... all perfectly well able to deal
with Unicode text in composed or decomposed or UTF-8 or whatever. The
display software that transforms those lines into some kind of long scrolling
text object will have to worry about how to do the proper char->glyph
transformations for the lines. The command line tools shouldn't care -- if
they do care, then the implementors haven't understood something fundamental.

The cursor-positioning types of programs really are more difficult to deal
with precisely because they depend on this one-to-one character-in-box
concept. ​But if you look at the problem slightly differently from the
traditional "text editor" visual tty application, it isn't that hard to come
up with something that works as a reasonable replacement for the text editors
used with terminal emulators.

I'm not saying that replaces the Traditional Terminal Emulator that is still
needed for communication with the "legacy" applications in other encodings
on large hosts, etc, etc. But those aren't the focus for where environments
are going.

In the environment I work in, which is BSD 4.4 based, we have display
systems that can handle the shell type interactions; and the text-editor type
interactions are written on top of a text processing/display platform. Both
kinds of tools are available to the user, dealing with Unicode data. (I'll
admit we don't have such a great Unicode story with the terminal emulator we
supply -- just your basic UTF-8 stuff, and all the usual problems that Markus
& Frank keep talking about.) If you want to run "emacs" as-is, using a
terminal emulator, you get just what emacs is -- and you have those
limitations. That's OK too, as long as you don't EXPECT to be able to use
any arbitrary script with your emacs. We don't -- we do that elsewhere with
other editors.

So Markus says...

> we can make a large number of scripts accessible under Unix in
> the old non-layered model rather easily, and there is no reason for not
> doing it.

Of course you can make lots of scripts available in the non-layered model.
It's easy to do. But where do you go from there? This "new" layered model
has been around for years and years, and there don't appear to be many
Unix/Posix systems that are even making an attempt to build a realistic "new
model" infrastructure to go with the command-line type of tool environment.
I just think it's kind of weird that the Unix crowd seems so unwilling or
unable to figure out how to really move the good parts of Unix forward into
the brave new world of non-Latin scripts and sophisticated text displays.

> The current font infrastructure does not provide the glyph annotations
> necessary for automatic good placement of combining characters.

Well that's a lame excuse. I've given two poorly attended talks at Unicode
conferences and discussed that very issue, explaining how we have been doing
it around here with PostScript & Truetype on Unix since about 1991 -- in the
absence of "necessary" annotations in the fonts. OK, it's not perfect, but
it's not so bad either; and if you have better metrics, you get better
rendering.

> - Many applications have to count characters is strings. This
> is trivial with both ISO 8859-1 (count bytes) and UTF-8
> (count bytes, except those in the range 0x80-0xBF), but it
> becomes more complicated and requires table lookup with the
> introduction of combining characters.

Many processes and libraries in Unix environments are unnecessarily hung up
on counting characters, I guess that's because C programmers didn't take
hints from Smalltalk or other environments that use "string" objects with
powerful APIs, even though that technology has been available for what, 25
years?

> We can't expect all applications to change over night to more sophisticated
> UI access techniques

Well, it hasn't been overnight by any means! I've been working on Unix
systems for 20 years, and at least the last 10 years have been devoted to
more sophisticated UI access techniques. What has the rest of the Unix/Posix
community been doing? They have apparently been arguing about LC_CTYPE and
isupper() or trying to shoe-horn all writing systems into one-to-one
correspondence with terminal cells.

> There is no immediate advantage from using combining characters.

Certainly there is, if you're doing anything with linguistics, some bits of
trivial math, an "unusual" language, etc. The presence of even simple
combining mark capability in the display opens up possibilities for plenty of
languages without any extra support being required beyond a typical Latin
font.

So Markus goes on...

> - I also fail to see why a decomposed form should in any way
> be more natural.
> It would be just fair to decompose ASCII before
> you start treating the ä as a second-class citizen. :)

It's not going to do us any good to go into a rathole arguing about
decomposing the dot on the i or tail on the Q. If you have been around even
a year you must have been exposed to a lot of reasons why certain characters
end up in the encoding. An international standard has to be able to cater to
a certain amount of existing data or nobody is going to use it. That's
pretty obvious. Of course there are compromises. But that's beside the
point... If Unicode were done with only purity and light in 1990, there
wouldn't be any precomposed Latin, Greek, or Cyrillic or Hangul -- but
probably that would not have been acceptable to nearly as many people around
the world. And part of the whole point of this exercise is to have a single
character encoding standard. Right?

I've used Unix since 1978, when it was an unknown maverick OS used by a
bunch of long-hairs at Berkeley, and it was such a mysterious and elite
environment that you could't even BUY it, you had to get it on one-off mag
tapes from Bell Labs. And then, suddenly it became the fashionable vanguard
-- the standard to which all environments should aspire...

Maybe I really should shut up... I guess I'm bitterly disappointed in how
the Unix and Posix community has not grasped the Unicode textual concepts and
progressed or led the way in all of this. The community seems so insular
and fossilized, when there are so many good things about Unix that have been
poorly imitated by other popular platforms. These days the industry is
moving right along doing all kinds of interesting display and many scripts &
languages, while the academic Unix (and Posix) folks are complaining that
it's too hard or can't be done at all.[1]

Oh sigh...

        Rick

-- -- -- --

[1] Except for Cap'n Leisher and his Merry Band of Pirates in New Mexico.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT