Re: New Locale Proposal

From: Antoine Leca (
Date: Thu Sep 21 2000 - 09:31:26 EDT

Carl W. Brown wrote:
> When these standards came out it was assumed that if you translated into
> Spanish that was that. Now you might want to have a primary Spanish and
> possibly one or two sub languages. Language like Spanish are not too bad
> but Korean, Chinese and Japanese take tremendous resources.

Spanish in the typical implementations of the 15897 system takes as much
resources as Japanese or Korean, because as the eņe collates differently,
existing implementations are unable to re-use (in the binary form) the
"normal" tables, so the result is exactely as heavy as a completely different
sort like Shift-JIS order for Japanese.

In fact, it takes twice as much resource, becasue there are usually two
Spanish locales (this is kind of a joke, since there are also more than one
Japanese locale)

> They also did not take into account not only collating tables and the like
> but things like word breaking dictionaries. Each language now takes about
> 64K by reusing resources. Hopefully by using a better locale system and
> other techniques, this can be reduced.

What you are criticizing is a kind of implementations rather than the
locale system itself. I am not sure about POSIX, but the C locale system
certainly can be implemented while re-using the work done with the Talligent
rechearchs (as it is done in JDK or ICU, but the ICU system does not attempt
to conform to the C standard). Including the dynamic aspects.

Also, it appears quite possible to me to pass between 15897 specifications
and the static forms of the Talligent/JDK/ICU model. I agree that dynamic
specifications is quite another thing.

> Another example. Currently we are debating Turkic languages. The 15897 is
> a code page based standard. Our locale being Unicode oriented is devoid of
> codepage requirements.

The argument is not very strong, since ISO/IEC 10646 is the preferred code page
to use for 15897; and furthermore the fact that it is code page based allow
to write e.g. a Klingon locale, which may be quite difficult to do if Unicode-
based (one has to use PUA, which impedes portability).

> Therefore changing Tatar to use a Latin script does not affect the locale.

Hmm, are you sure ?
I certainly do not know your locales, but I would be *very* surprised if
a locale implementation written say 1 year ago have "I" to "dotless i" as
tolower mapping for Tatar; just because this is the kind of guessing that
is more than difficult to do (not everybody owns a crystal ball).

And BTW, certainly it does affect it, since some aspects of the locale
implementation involve the strings to display on some cases, such as for
the name of the days... unless there are a mechanism in the locale
implementation that "detects" the surrounding script used and react
accordingly; not impossible, but probably costly and overkill for many uses.

> In fact it does not change any Unicode processing is as we are discussing
> you handle the case shifting properly.

I cannot parse your point.

> If Turkmen decides to follow suit and shift to a Latin script the same would
> apply. Turkmen currently uses Cyrillic and Arabic scripts.

Can you decide today if the to-be-determined Latin Turkmen will use one "i"
or two? And you do, what is the difference with Keld's 15897 proposition?

> The locale system is designed for so that users can implement locales with
> subtleties that go beyond the 433 ISO 639 languages as well as be platform
> independent.

I would say, before going beyond the "433" ISO 639 languages, better filling
them... (BTW, my count is more around 1000 than 433 for ISO 639-2)

OTOH, I grant you that last time I looked, 15897 was limited to ISO 639-*1*
language codes, i.e. more 200 than 400; so no way for a number of languages,
and furthermore no private use codes. However, it is quite easy to go over
this limitation, although you then loose the interoperability benefit.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT