UTF-8, ISO C Am.1, and POSIX

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Tue Aug 12 1997 - 11:37:45 EDT


Keld J|rn Simonsen wrote on 1997-08-10 20:12 UTC:
> We have in the ISO POSIX WG been thru all POSIX standards to see
> what changes we should do to the standards to accompdate UCS.

I guess, pretty much the only thing required in the POSIX standard for UTF-8
is a standardized way to tell the locale mechanism that the character encoding
used is UTF-8. UTF-8 is a little bit more than yet another character
table, so there should be some locale flag or something like this that
allows me to tell libc that UTF-8 is the used encoding.

So far, my preliminary trick was that libc assumes UTF-8 encoding is used
if the name of the locale fits the regular expression "*[uU][tT][fF]-?8*"
in anticipation of what typical UTF-8 based locale names will look like,
but locale name (LANG, LC_CTYPE, etc.) parsing is probably not a nice
long-term solution, although many applications do this (I think, emacs
checks for the substring 8859 in LANG and LC_CTYPE).

What's the state of the standardization with regard to specifying in a
locale that we use UTF-8? How does enUS.UTF-8 look like?

It might also be useful, if POSIX would clairfy, how all the new
ISO C Am. 1 functions for wide streams and multi-byte strings work in
detail if we have selected the UTF-8 encoding in the locale. The
ISO C standard does not talk about UTF-8 and the multibyte string
concept is pretty abstract, so I feel implementors will have problems
coming up independently with compatible UTF-8 implementations of all the
ISO C Am.1 functions.

I'd be very interested in all work that has already been done in this
field, to avoid that we have to reinvent some wheels for Linux.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT