Re: UTF-8 and POSIX locales

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Sat Jun 26 1999 - 12:58:05 EDT


On Thu, Jun 24, 1999 at 02:12:32AM -0700, Markus Kuhn wrote:
> Karlsson Kent - keka wrote on 1999-06-24 08:24 UTC:
> > It would be better to always use the data found in
> > the 'Unicode character database' file, rather than that found
> > in other lists of character properties, lists that are not kept
> > as up-to-date nor are as keenly reviewed. And even if they
> > so were, might still arbitrarily, and implicitly, diverge from
> > Unicode's data.
>
> Well, it should not be too difficult - I hope - to automatically
> generate one file from the other. If someone provides me access to the
> specification, I might find the time to write a small Perl script that
> does exactly that fully automatically and repeatable.

Yes, I think that could be done. The Internet charcater specifications
may have had much review too, and also the ISO 15897 registrations have
had much review (partly via the Internet process) and has the added
feature of being de jure standards.

> What I do not like at the moment about the locale mechanism is that it
> ties together the character encoding and the cultural conventions, which
> I think are two completely orthogonal things. Even worse, there are some
> systems out there that provide UTF-8 locales only tied to a cultural
> convention set, in the worst case they provide only en_US.UTF-8. So to
> get UTF-8, I would also have to accept the strange US date/time
> notations and in some programs even default settings for non-metric
> units and strange US paper sizes, US-specific terminology such as "ZIP
> code" instead of "postal code", etc., all derived from the "en_US" part
> of the locale name.

The locale and charmap concepts of POSIX are defined in an orthogonal way
and could be implemented orthogonally, like the locale eg always
internally stored on UCS while the charmaps then providing the
conversion in and out of UCS.

> The standard should specify the name of some default locales that do
> specify an encoding, but that otherwise copy just the cultural
> conventions of the C or POSIX default locales. The names of these
> locales could for instance be
>
> POSIX.UTF-8
> POSIX.ISO_8859-1
>
> etc., or perhaps even better just
>
> UTF-8
> ISO_8859-1
>
> etc. I also like the idea of a standard locale named "ISO.UTF-8" or
> "international" that uses UTF-8 and fills in other cultural conventions
> according to ISO standards, e.g. ISO 8601 for the date/time notation
> and ISO 31 for the formatting of monetary units (currency appended with
> a space behind the number, just like any SI unit).
>
There is such a "locale" defined in 14652.

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:47 EDT