Re: UTF-8 and POSIX

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Wed Jun 23 1999 - 15:43:19 EDT


Keld J|rn Simonsen wrote on 1999-06-23 17:40 UTC:
> On Wed, Jun 23, 1999 at 07:37:15AM -0700, Markus Kuhn wrote:
> > Is there any work going on to review the POSIX.1 and POSIX.2 standards
> > systematically to add proper UTF-8 support? For instance,
> > the terminal driver can be set into a "cooked" mode where a
> > single-line editing mechanism is applied before sending a line to an
> > application, and the implementation of the erase function there has to
> > know how many bytes to remove when a character is erased, which makes a
> > difference between UTF-8 and ISO 8859-1 for instance. There should be a
> > standard way to tell the terminal that it is in UTF-8 mode and has to
> > perform character erase actions accordingly.
>
> Hmm, why should UTF-8 support differ here from say EUC support?
> The support should be there already.

I see neither EUC nor UTF-8 support in any POSIX document for system
calls such as tcsetattr() that would allow me to tell the terminal in
c_lflag|ICANON mode how many bytes to remove when it receives an ERASE
character. I don't care much about EUC support, because this is not an
ISO standard, but UTF-8 is one and should be fully and consistently
supported here IMHO.

Vendors are setting up proprietary and non-portable solutions to work
around such deficiencies in the POSIX standard regarding UTF-8. For
example (quoting from an email from Tomas Vanhala
<vanhala@ling.helsinki.fi>):

   I am curious of this, because at least on Solaris 7, it is also
   possible to utilize the UTF-8 locale support built into the OS.

   If you go to http://docs.sun.com/, choose the "Solaris 7 Software
   Developer Collection" and then the "Solaris Internationalization Guide
   For Developers", you will find that the document contains a section
   titled "Overview of en_US.UTF-8 Locale Support". The paragraph
   "TTY Environment Setup" of the subsection "System Environment"
   explains some UTF-8 specific STREAMS modules, e.g.

   /usr/kernel/strmod/eucu8 UTF-8 STREAMS module for tail side
   /usr/kernel/strmod/u8euc UTF-8 STREAMS module for head side

   Further down on the page, it is stated that:

   The dtterm(1) and any terminal that supports input and output of the
   UTF-8 codeset should have the following STREAMS configuration:

   head <-> ttcompat <-> u8euc <-> ldterm <-> eucu8 <-> pseudo-TTY

   This can be setup with strchg(1) user-level program, if the
   appropriate kernel modules have been loaded.

Is this really specified by POSIX?

The Linux version of stty and the tty driver in the kernel is currently
being extended to accommodate for UTF-8. Unfortunatelly, POSIX.1:1996
does not give us any guidance of how to do this in a portable way. (See
<ftp://ftp.ilog.fr/pub/Users/haible/utf8/> for the patches.)

> We have in WG20 enhanced the locale syntax to be able to cater for
> ISO 10646 in the forthcoming ISO/IEC 14652 TR.

Very interesting! URL???

> UTF-8 does not need to be implemented as a charmap, it could be
> implemented as something special.

If there is now really a new syntax defined to activate this "something
special" in the locale definition files, than i am very happy to hear
that and I am looking forward to see the details.

> > Anyone knowing on the current status of UTF-8 and POSIX?
>
> I wrote a paper on 10646 support for WG15, which is now
> included in the current draft of TR 14766. It base idea was using UTF-8
> as a standard in all POSIX standards.

I know of

  http://www.cl.cam.ac.uk/~mgk25/ucs/iso-tr-14766.txt

which I had to dig with Emacs artistic out of a proprietary word
processing file format found on

  http://anubis.dkuug.dk/jtc1/sc22/wg15/iso14766/gnp3.wp

Hm, but this contains not much that wasn't already obvious from the old
USENIX Pike/Thompson Plan9/FSS-UTF paper in

  ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz

Is there an updated version of your paper available that also covers new
less obvious stuff such as non-charmap processing in locale
specifications and tcsetattr() kernel terminal driver configuration for
UTF-8?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT