Kenneth Whistler wrote on 1999-08-27 22:33 UTC:
> But in the Unicode world this does not work. You have to architect
> the layers:
> Layer 1: Map the plain text characters into a rendering space (implies
> smarts about scripts, a non one-to-one character to glyph
> mapping, information about the font metrics, and bidi layout).
> Layer 2: Embed the glyph vectors into the control code framework for
> terminal control and cursor positioning.
> Layer 1 is host business entirely. Only there do you have access to
> the plain text store and a sufficient model of the text to do the
> right thing.
> Layer 2 can be modeled on the current terminal control protocols. You
> just need to be aware of the fact that you are dealing with glyph
> codes that map into the terminal display fonts -- *not* with characters.
We certainly agree, that Unicode requires a (for some scripts) somewhat
non-trivial processing step between the memory representation and the
glyph sequence that shows up on the screen at the end. We will probably
end up with more GUI-like libraries (comparable to say the [n]curses
library) that sits between the host application and the terminal
emulator and keeps track of things like the cursor split necessary
for bidi rendering, etc. I get more and more the feeling that handling of
Hebrew, Arabic, and the various Indic scripts will not be feasible by
extending the terminal semantics alone in a way that still allows us to just
dump with printf() the memory representation to the terminal, which will
somehow magically sort out in real-time everything. I can well imagine that
this does work for combining characters (which have a fairly simple
semantics, all state required to interpret a combining character is
after all just the cell coordinates of the last printed character, which you
can easily save), but I have already doubts for Arabic and most
certainly for the Indic scripts.
The answer will simply be, that the traditional dump-memory-to-terminal
printf applications (cat, echo, etc. are typical trivial representatives of
this class) will not work with such scripts.
However, we can make a large number of scripts accessible under Unix in
the old non-layered model rather easily, and there is no reason for not
doing it. It would in my opinion be a fatal mistake to stay with ISO 8859
as opposed to UTF-8, just because we shy away at the moment from a full
Devanagari renderer for xterm.
My reasons for staying with precomposed characters in the Unix non-GUI
environment for quite some time are:
- The current font infrastructure does not provide the glyph
annotations necessary for automatic good placement of combining
characters. We therefore have to work with precomposed glyphs
and would have to do a normalization C in display
- Many applications have to count characters is strings. This
is trivial with both ISO 8859-1 (count bytes) and UTF-8
(count bytes, except those in the range 0x80-0xBF), but it
becomes more complicated and requires table lookup with the
introduction of combining characters. We can't expect all
applications to change over night to more sophisticated UI
access techniques and there will be heavy resistance if we
take away beloved simple output methods such as printf().
- There is no immediate advantage from using combining characters.
They require more storage, have (at the moment) to be recomposed anyway
before display, and only save (arguably) a few CPU cycles in
algorithms such as collating.
More philosophical (and therefore more fun to discuss :):
- I also fail to see why a decomposed form should in any way
be more natural. I see the decomposed form more as a technically necessary
brief intermediate step for rendering fonts that provide font
compression by storing commonly occurring glyph fractions (e.g., the base
glyphs and accents, hooks, descenders, etc.) separately and combine
them only on demand at rendering time. The choices made about which
glyph components (and yes, we talk about glyphs and not characters
here) deserve to become Unicode characters on their own right do not
appear to be very systematic to me and seem to me to be more influenced
by historic perception than by a clean technical analysis. I have to
agree with the argument that there is no reason, why "ä" can be decomposed
into a + ¨, but "i", "j", ";", and ":" can't be decomposed into a
sequence with a dot-above combining character. After all, all of
them exist also without the dot above, and many also with
many other things above (iìíîï). Why isn't Q represented as an
O with a lower-left stroke? Because all these precomposed characters
have just stopped to be perceived as being composed by those who
designed Unicode and its predecessors (ASCII, Baudot, Morse, etc.)
Nevertheless, G is historically a C combined with a hook, Ws are
two Vs (or Us) with a negative space in between, + is just a
"not -" and therefore crossed out, $ = S + |, and @ is just an
"a" in a circle. It would be just fair to decompose ASCII before
you start treating the ä as a second-class citizen. :)
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT