Re: UTF-8 in Linux console

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Jan 12 1999 - 16:26:37 EST


"H. Peter Anvin" wrote on 1999-01-12 19:41 UTC:
> > > I dropped kernel work on it after 1.3.0, when it looked like hpa was
> > > taking over. However, roughly speaking nothing happened afterwards,
> > > and the current situation is far from satisfactory.
> >
> > Same here.
>
> Yes, I think we've all rather suffered from the confusion of
> responsibility. I myself have been reluctant to go in because I felt
> I step on your toes...

Ah, thanks for making this clear. Another misunderstanding resolved,
because I had assumed that you had taken over.

I am *very* happy to hereby release any responsibility whatsoever that I
might still have had for the UTF-8 code in the Linux console driver to
you.

I will work on other UTF-8 fronts, e.g. the -fixed-*-iso10646-1 fonts
plus overall stimulation/coordination of various other Linux UTF-8
projects. Please consider yourself hereby to be fully in charge of UTF-8
and Unicode in the Linux console. Please do not worry about breaking any
backwards compatibility if this is necessary for a clean design, because
practically nobody is understanding or using the existing UTF-8 console
support today (and I have the impression that the keyboard support is
currently broken with nobody even noticing it).

Suggestions of what can be fixed rather quickly (may be even in 2.2?):

  - Please make clear in the documentation or comments that you are
    now maintaining the UTF-8 aspects of the console.

  - Please remove the old ESC % 8 activation code for UTF-8. I had
    introduced this only as a temporary hack since at that time the
    now official ESC % G was not yet defined by ISO/ECMA. I hate to
    see it being left in there for the next few decades in the name of
    unnecessary compatibility ... :-)

  - Please make sure that the ISO 2022 ESC codes switch both the console
    display and the console keyboard simultaneously. This is also what
    all other ISO 2022 terminal emulators (kermit, xterm) are expected
    to do.

  - Please add the three officially registered ISO 2022 ESC sequences

      ESC % / G
      ESC % / H
      ESC % / I

    as alternatives for ESC % G, but with the difference that when the
    switch to UTF-8 was done with one of these three, no return with ESC % @
    is possible (see <http://www.itscj.ipsj.or.jp/ISO-IR/>). These three
    ESC sequences announce the three levels of ISO 10646-1, but since for
    a terminal emulator the ISO 10646-1 implementation levels do not
    make any difference, just handle all three sequences as synonyms.
    It is nice to be able to permanently disable ISO 2022 for the remaining
    session, such that accidental binary dumps can't switch into an
    uncontrolled ISO 2022 state any more.

  - At the moment, illegal UTF-8 characters are silently ignored.
    I now believe that this is neither a good idea (makes debugging
    more difficult) nor in conformance with ISO 10646-1. Illegal
    UTF-8 sequences such as 0xfe, 0xff, and unexpected or missing
    10xxxxxx sequences should be indicated by the REPLACEMENT CHARACTER
    as specified in ISO 10646-1 section R.7 "Incorrect sequences of
    octets: Interpretation by receiving devices" (see
    <ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/ISO-10646-UTF-8.html>).

  - In my original code certain non-spacing Unicode some characters such as
    the "zero-width no-break space" were completely ignored and did not
    advance the cursor by one position. I now believe that this was a
    bad idea. *Every* graphical Unicode character should advance the
    cursor by exactly one cell in a VT100 emulator (we can't
    handle two cell wide East Asian characters in VGA text mode), and
    if the application wants to ignore a few characters such as U+FEFF
    or U+200B-200D for output on a VT100 terminal as non-spacing ones,
    than the application shall have the sole responsibility for removing
    these characters, and not the terminal emulator. Everything else
    just would make debugging and compatibility much more difficult.
    This is also what I will suggest to the authors of other UTF-8
    terminal emulators.

  - Unicode introduces two new control codes that the console driver is
    not handling at the moment. I suggest to handle them as follows:

      U+2028 LINE SEPARATOR handle just like CR LF
      U+2029 PARAGRAPH SEPARATOR handle just like CR LF LF

Most of these suggestions apply also to other VT100 terminal emulators
(kermit, xterm, etc.). May be I'll set up a web page covering such
suggested conventions for UTF-8 capable VT100 emulators.

> Yes, it's a mess right now. Especially with fbcon simulating VGA
> limitations and all... yuck. I'm sort of interested in what the
> KGIcon people have been up to, as well.
>
> I also believe the choice of ioctl()s as the setting mechanism for a
> lot of these things was really bad. It makes it hard to implement
> things out of the kernel where appropriate.

Agreed.

> Anyway, I think we need to decide what we want to do in 2.3 and do
> it. If that mean a rewrite from scratch, I'm still game...

Great.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT