Re: UTF-8 and POSIX

From: Ienup Sung (ienup.sung@eng.sun.com)
Date: Wed Jun 23 1999 - 19:54:24 EDT


(For Solaris part of your email, I know you didn't actually ask it as a question
but I would like to answer anyway as like below.)

POSIX/XPG specifications do not dictate every detail of a particular
codeset/encoding's implementation. I certainly agree with you though that in
certain areas, it would be nice to have a specific definition in
the specifications, e.g., what kind of control mode command should I use at
the stty(1) to make the line discipline of the current Stream understand
the current locale (that contains the codeset). And, that's something that
we would like to have in the future version of the specifications. Others, like,
what kind of STREAMS modules in my Stream should I have and how to stack are,
should not be defined by the specification because I believe those are
implementation-specific.

In anyway, the method used in Solaris 2.6 and 7 implementations is simply
a way on how to support UTF-8 in line discipline component. We are also
currently trying to improve the support in the ldterm(7M) STREAMS module
so that as you mentioned, stty(1) and the ldterm(7M) are the only things that
will be needed to have the complete UTF-8 line discipline for Unicode locales at
the next major release of Solaris.

With regards,

Ienup

] Date: Wed, 23 Jun 1999 12:36:48 -0700 (PDT)
] From: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
] Subject: Re: UTF-8 and POSIX
] To: Unicode List <unicode@unicode.org>
] MIME-version: 1.0
] Content-transfer-encoding: quoted-printable
] Content-transfer-encoding: quoted-printable
]
] Keld J|rn Simonsen wrote on 1999-06-23 17:40 UTC:
] > On Wed, Jun 23, 1999 at 07:37:15AM -0700, Markus Kuhn wrote:
] > > Is there any work going on to review the POSIX.1 and POSIX.2 standards
] > > systematically to add proper UTF-8 support? For instance,
] > > the terminal driver can be set into a "cooked" mode where a
] > > single-line editing mechanism is applied before sending a line to an
] > > application, and the implementation of the erase function there has to
] > > know how many bytes to remove when a character is erased, which makes a
] > > difference between UTF-8 and ISO 8859-1 for instance. There should be a
] > > standard way to tell the terminal that it is in UTF-8 mode and has to
] > > perform character erase actions accordingly.
] >
] > Hmm, why should UTF-8 support differ here from say EUC support?
] > The support should be there already.
]
] I see neither EUC nor UTF-8 support in any POSIX document for system
] calls such as tcsetattr() that would allow me to tell the terminal in
] c_lflag|ICANON mode how many bytes to remove when it receives an ERASE
] character. I don't care much about EUC support, because this is not an
] ISO standard, but UTF-8 is one and should be fully and consistently
] supported here IMHO.
]
] Vendors are setting up proprietary and non-portable solutions to work
] around such deficiencies in the POSIX standard regarding UTF-8. For
] example (quoting from an email from Tomas Vanhala
] <vanhala@ling.helsinki.fi>):
]
] I am curious of this, because at least on Solaris 7, it is also
] possible to utilize the UTF-8 locale support built into the OS.
]
] If you go to http://docs.sun.com/, choose the "Solaris 7 Software
] Developer Collection" and then the "Solaris Internationalization Guide
] For Developers", you will find that the document contains a section
] titled "Overview of en_US.UTF-8 Locale Support". The paragraph
] "TTY Environment Setup" of the subsection "System Environment"
] explains some UTF-8 specific STREAMS modules, e.g.
]
] /usr/kernel/strmod/eucu8 UTF-8 STREAMS module for tail side
] /usr/kernel/strmod/u8euc UTF-8 STREAMS module for head side
]
] Further down on the page, it is stated that:
]
] The dtterm(1) and any terminal that supports input and output of the
] UTF-8 codeset should have the following STREAMS configuration:
]
] head <-> ttcompat <-> u8euc <-> ldterm <-> eucu8 <-> pseudo-TTY
]
] This can be setup with strchg(1) user-level program, if the
] appropriate kernel modules have been loaded.
]
] Is this really specified by POSIX?
]
] The Linux version of stty and the tty driver in the kernel is currently
] being extended to accommodate for UTF-8. Unfortunatelly, POSIX.1:1996
] does not give us any guidance of how to do this in a portable way. (See
] <ftp://ftp.ilog.fr/pub/Users/haible/utf8/> for the patches.)
]
] > We have in WG20 enhanced the locale syntax to be able to cater for
] > ISO 10646 in the forthcoming ISO/IEC 14652 TR.
]
] Very interesting! URL???
]
] > UTF-8 does not need to be implemented as a charmap, it could be
] > implemented as something special.
]
] If there is now really a new syntax defined to activate this "something
] special" in the locale definition files, than i am very happy to hear
] that and I am looking forward to see the details.
]
] > > Anyone knowing on the current status of UTF-8 and POSIX?
] >
] > I wrote a paper on 10646 support for WG15, which is now
] > included in the current draft of TR 14766. It base idea was using UTF-8
] > as a standard in all POSIX standards.
]
] I know of
]
] http://www.cl.cam.ac.uk/~mgk25/ucs/iso-tr-14766.txt
]
] which I had to dig with Emacs artistic out of a proprietary word
] processing file format found on
]
] http://anubis.dkuug.dk/jtc1/sc22/wg15/iso14766/gnp3.wp
]
] Hm, but this contains not much that wasn't already obvious from the old
] USENIX Pike/Thompson Plan9/FSS-UTF paper in
]
]
ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz
]
] Is there an updated version of your paper available that also covers new
] less obvious stuff such as non-charmap processing in locale
] specifications and tcsetattr() kernel terminal driver configuration for
] UTF-8?
]
] Markus
]
] --
] Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
] Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
]



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT