UTF-8 and POSIX

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Wed Jun 23 1999 - 10:42:27 EDT


Is there any work going on to review the POSIX.1 and POSIX.2 standards
systematically to add proper UTF-8 support?

I don't think much has to be done, but there are a few crucial bits. For
instance, the terminal driver can be set into a "cooked" mode where a
single-line editing mechanism is applied before sending a line to an
application, and the implementation of the erase function there has to
know how many bytes to remove when a character is erased, which makes a
difference between UTF-8 and ISO 8859-1 for instance. There should be a
standard way to tell the terminal that it is in UTF-8 mode and has to
perform character erase actions accordingly.

Also the syntax for the entire locale database mechanisms was really
designed for small 8-bit character sets and becomes rather horrible when
applied to UTF-8. I get the impression that wchar_t <-> UTF-8 conversion
is supposed to be done by table lookup of UTF-8 byte sequences as
opposed to the obvious conversion algorithm. UTF-8 would certainly
deserve some special treatment here as a recognized encoding in the
locale system.

Anyone knowing on the current status of UTF-8 and POSIX?

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT