Re: Normalization Form KC for Linux

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Aug 27 1999 - 18:33:45 EDT


Frank commented:

> This raises issues discussed long ago. The model of base character
> followed by nonspacing diacritics (as opposed to the other way around)
> does not mesh well with terminal/host communication, where incoming
> characters must be displayed in realtime. A letter A arrives and the
> terminal displays it in the current position and moves the cursor to the
> next position, which might be on a new line due to screen wrap, and
> this, in turn might cause scrolling, in some cases even off the screen
> due to narrow vertical margins. Then a nonspacing acute accent arrives.
> At this point, the terminal has to find where it left the A and change
> it to something else, but this time avoid the wrap since it was done
> previously, and then put the cursor back where it was before. (The
> situation might be even more confusing when the host controls wrapping.)
>
> This is an awful lot of work and screen changing for little benefit when
> precomposed characters are already available. In any case, changing a
> character after having already drawn it is not the best "human
> engineering" -- terminal users are not accustomed to having to reread
> text already read in case it changed, and this will be especially
> noticeable when a congested network delays the arrival of a diacritic.
>

I tried to make a point about this terminal business a couple weeks
ago when bidi was the issue. Terminal drivers do not deal with
plain text in the sense that Unicode means it -- they embed characters
in control streams designed to put glyphs and cursors into designated
positions on the screen.

The basic fallacy of thinking that has been displayed here is the confusion
of characters and glyphs. In the old 8-bit world, everything tended to
be designed so that character equalled glyph, so you could talk about
displaying the "characters" on the terminal, and could look into the
data stream and find the characters there, neatly aligned with their
display. The non-spacing character first model for handing composition
on terminals was just an epicycle on the basic model.

But in the Unicode world this does not work. You have to architect
the layers:

Layer 1: Map the plain text characters into a rendering space (implies
         smarts about scripts, a non one-to-one character to glyph
         mapping, information about the font metrics, and bidi layout).

Layer 2: Embed the glyph vectors into the control code framework for
         terminal control and cursor positioning.

Layer 1 is host business entirely. Only there do you have access to
the plain text store and a sufficient model of the text to do the
right thing.

Layer 2 can be modeled on the current terminal control protocols. You
just need to be aware of the fact that you are dealing with glyph
codes that map into the terminal display fonts -- *not* with characters.

Instead of arguing incessantly about how bad combining marks are for
Latin, and how Unix, and xterm, or whatever *must* adopt the
Normalization form C (or KC), try to consider what you
would have to do to make xterm work for Devanagari or Tibetan. If you
cannot figure out how to input and display Figure 2-2 from the
Unicode Standard (including intermediate renderings on input) then
the display model for xterm (or whatever) is just broken.
Acknowledge the limitation, and don't use try to use it as hammer to
keep beating on combining marks.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT