Re: combining/fullwidth support for xterm

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Aug 17 1999 - 03:38:59 EDT


Asmus Freytag wrote on 1999-08-17 02:19 UTC:
> Markus,
>
> You should make clear in your function headers what version of the standard
> the data apply to. The current most up-to-date version is Unicode-3.0.0.beta.

Will do. The iscombining table was automatically generated from
Unicode-3.0.0.beta, the iswide table was manually generated from the
TR 11 tables (because I wanted also to cover unassigned code numbers
in purely wide blocks, in order to make it less likely that the table
will have to be changed much with future extensions.)

> Second, characters with EastAsianWidth A may also well be wide in the given
> application domain. A means (iswide returns either true or false depending
> on other context information (i.e. language or locale id, knowledge of
> ultimate data source or destination being an EA legacy character set etc.).

This is exactly what I want to avoid! I wrote these functions, in order
to give a clear guideline, which characters xterm should take from the
fullwidth font (say 12x13ja.bdf) and which from the halfwidth font (say
6x13.bdf). I want to eliminate this way exactly the ambiguities that
TR11 leaves here. The goal is to have a fixed and predictable wide
character behaviour for VT100 terminal emulators. The goal is
deliberately NOT to be backwards compatible to the cruel and unusual
wide-character behaviour of the various pre-Unicode East Asian legacy
character sets. Their wide Latin/Greek/Cyrillic characters look just
plain ugly, and they are not something we should carry over into the
Unicode terminal emulator world. Therefore, I suggest that xterm,
kermit, the Linux console, etc. all take any character with the
EastAsian Ambiguous (A) property from the normal (half-width) font (one
character = one cell). The implementations are free to offer special
additional backwards-compatibility modes for EUC-style behaviour, etc.,
but this should not be the normal standard mode of operations under
UTF-8.

I do NOT want to need complex termcap extensions with tables that
describe which characters a terminal emulator treats as wide or
non-spacing. I want to have a simple standard here. The simple role
should be that everything that smells like a CJK ideograph (i.e.,
belongs into the W or F class in TR 11) occupies two cells, and the rest
occupies one cell. Nonspacing characters in category Mn or Me occupy
zero cells. That should be all there is.

Sure, I could think of more clever arrangements, in which some selected
Latin characters are also taken from the wide font. Especially the
various Latin and Cyrillic digraphs come to mind, as do the ligatures
and the em dash. But I feel that this would just complicate things
unnecessary and that introducing a few double-width characters into the
Latin script is not really worth the hassle. Life stays simple if
terminal users who do not use CJK ideographs never have to worry about
full-width characters.

> Finally, a small compiler that reads the data files and produces the source
> code you showed, would be so much more useful as it would allow people to
> update from the Unicode data base.

I'll probably do this for iscombining. For iswide, I'll probably only
provide a verifier that checks whether the function does not contradict
anywhere to the latest Unicode tables. I want unassigned positions in
the CJK blocks to map to wide characters, to accommodate for the most
likely future extensions, and that is easier done with manual table
generation.

The time is now really ready to think about writing a draft technical
report on guidelines for Unicode extensions to VT100/ISO 6429 terminal
emulators such as xterm (and perhaps even MS-Window's ANSI.SYS ;-). I
have already presented my ideas related to biwidth and combining
characters. I am still quite puzzled about bidi behaviour and also a bit
about how the terminal emulators should react on the PARAGRAPH SEPARATOR
and LINE SEPARATOR characters or the Plane14 language tags. There should
also be some strict conventions on how precisely to react on various
types of illegal and overlong UTF-8 sequences, such that the host can
even after these still predict the exact cursor position. For handling
illegal UTF-8 sequences, I suggest that the current XFree86 xterm
algorithm is adequate.

The result could be published as both a Unicode technical report as
well as an addendum to ECMA 48 and ISO 6429.

Interested?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT