Re: bidi support for xterm

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Mon Aug 16 1999 - 18:04:32 EDT


Juliusz Chroboczek wrote on 1999-08-16 16:44 UTC:
> Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>:
>
> MK> However, mere implementations of the Unicode bidi algorithm are far from
> MK> what we need to really understand how to handle bidi text in xterm or
> MK> other VT100/ISO 6429 emulators.
>
> Before we can ever start wondering about this, we need to decide how
> much is to be handled in the terminal emulator and how much in the
> application.

Fully agreed. When we embed UTF-8 in the very fabric of GNU/Linux,
I feel that it is extremely helpful to discuss how applications should
behave by discussing three very simple benchmark applications:

  a) echo (or cat)
  b) ls
  c) readline()

a) Programs such as echo or cat should require no modifications
whatsoever in order to be usable in an UTF-8 environment.

  => it would be desirable if xterm would apply automatically the
     Unicode bidi algorithm to sequences of characters that it
     receives.

b) It is unavoidable that programs such as "ls" experience some
modification before they become fully usable under UTF-8. The critical
characteristic of ls is here, that it outputs file names in a table
layout, and therefore it has to predict how many character cells a
string will occupy, which will determine the number of table
columns that can be used. Three considerations have to be made here:

  1) ls must know that UTF-8 bytes of the form 10xxxxxx do not
     occupy their own character cell as they are continuation bytes

  2) ls must know that combining characters do not occupy their own
     character cell

  3) ls must know that characters with the East Asian Wide of FullWidth
     property (see TR #7) occupy two character cells.

I frequently hear that programs such as ls should internally convert
everything to wchar_t as if this would solve all Unicode problems. This
is naive, because in the context of combining and EastAsian Fullwidth
characters, there still is no 1:1 relationship between character cells
and wchar_t characters. The temporary decoding into wchar_t is still
of use here, because it will simplify the tests for whether a character
is combining or wide, but this decoding can be done on-the-fly for just one
character at a time. there is no need to keep wchar_t strings internally.

I hope we can find a solution for xterm such that ls can remain fully
ignorant with regard to the bidi properties of characters in filenames.
But then considering that RTL script tables would naturally be right
aligned, I do not think that there is a perfect solution for how ls
should be have with Hebrew file names.

When I write xterm here, I mean of course all VT100 emulators, including
also kermit, the Linux console, etc. All these should behave
identically, here, which will require careful discussion and
documentation.

c) Editors such as the readline() function (the command line editor
of the shell and similar programs) will need to be fully aware of
the bidirectional characteristics of the edited string, because this
affects cursor control and numerous other things. For readline,
it would be best if (as Juliusz suggested) xterm remained ignorant
about bidi properties of character and remained always in LTR mode
as it does today. Then the editor such as readline() could fully
control the correct appearance and behaviour of the bidi capable editing
functionality. However, if xterm also aims at taking over the bidi
formatting in applications such as cat, the situations becomes double
complicated, because not only has readline() to take care of its own
bidi functionality, but it also has to undo that which xterm will try
to apply. If things went extremely elegant, it might be possible
that the double complexities that arise for an editor canceled each
other out, but I have no implementation experience here.

The big questions are:

A) Do we want to have for Linux that plain text files, file names,
streams, environment variables, etc. follow the Unicode convention that
characters always appear in reading order in memory?

B) Do we want that trivial applications such as cat, echo, and even ls
can remain fully ignorant about bidi characters?

I think the answer should be YES in both cases if we want to take
ubiquitous Hebrew/Arabic support really seriously.

This requires that the bidi algorithm is by default actively operating
in xterm. It might be a good idea to provide some ESC sequences to
deactivate the Unicode bidi algorithm in xterm for full-screen editors
such as readline, vi. emacs, etc. in order to make life simpler for the
authors of these. We could simply specify that any editing ESC sequence
(cursor repositioning, character inserting/deleting, line inserting/
deleting, etc.) automatically deactivates the Unicode bidi algorithm in
xterm, and it is then the responsibility of the application to handle
bidi formatting itself. Every received LF would indicate that plaintext
output has started again and would put the xterm back into the default
mode, in which it executes the bidi algorithm (starting with the line
that was terminated by the first LF). Executing the bidi algorithm in
xterm only when an LF is received, and letting it operate on all
characters received since the last LF or other cursor manipulation might
indeed be one possible simple model of how bidi could potentially be
integrated smoothly into xterm operation.

Whether something like this is practical, only practical experience will
show.

> I would be glad to know which model is mandated by ISO 6429.

You can find ISO 6429 = ECMA 48 on

  http://www.ecma.ch/stand/ECMA-048.HTM

ISO 6429 predates Unicode and therefore does (if I remember correctly)
make no provision for the terminal automatically handling bidi switching
as Unicode requires it. ISO 6429 is therefore completely useless here
for the goal that "cat", "echo", and "ls" can remain ignorant about
about the bidi properties of characters, while we keep the writing order
in stored strings. We need something beyond ISO 6429. ISO 6429 does
provide ESC sequences with which the application can switch the writing
direction. These are pretty useless, because if the applications knows
already about the bidi properties of the characters that it sends to the
terminal, then it can as well do the string reversing itself instead of
telling the terminal in which direction to write. ISO 6429 is only
useful if you want to switch the entire terminal into a pure RTL mode
for exclusive display of Hebrew texts, something that is rarely required
among the followers of the Unix religion, who's sacred applications tend
to clutter output with Latin fragments (ever seen hexadecimal numbers
and "http://" written purely in Hebrew? :-).

My personal (and probably politically incorrect) conclusion:

We can try to implement bidi in xterm, but no matter what we do, it is
guaranteed to become a big mess in any case. I personally would much
prefer to restrict usage of correctly Unicode encoded Hebrew and Arabic
to special applications that take completely care of bidi (web
browsers, multi-lingual word processors), and to and recommend that
Arabic/Hebrew scripts are for the time being considered to be unsuitable
for use in file names, shell commands, and other low-level usages. The
same might perhaps apply for some of the Indic scripts with more complex
rendering requirements. Let's keep these better out of xterm.

Very old scripts such as Hebrew and Arabic are RTL, because this is
optimal for right-handed chiseling of letters into stone with a hammer.
More recent scripts are LTR (or TTB), because this is optimal for
right-handed writing with ink on paper. RTL scripts are perhaps an
example, where we are much better off (implementation complexity wise),
if we do *not* try to support supposed user requirements on all levels
(e.g., file names) at any cost, or we will quickly end up with de-facto
unmaintainable software.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT