Re: New to Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jul 24 2006 - 04:43:12 CDT

  • Next message: Michael Everson: "Re: Proposal to encode an EXTERNAL LINK symbol in the BMP"

    From: "Peter Constable" <petercon@microsoft.com>
    > [at the risk of straying off topic... ]
    >
    > And when you decide you need to distinguish between simplified and traditional Chinese text, don't use zh-CN and zh-TW to distinguish these; use zh-Hans and zh-Hant.

    Why saying "don't" ? There are many legacy applications that don't have any special support for iso 15924 codes in locale ids; locale-ids most often don't indicate where the language is used, but a place where the language is spoken; This is the case in Java where the Locale class was designed at a time when ISO 15924 did not exist.

    In Java, the standard (legacy) behavior is to detect Traditional Chinese and Simplified Chinese installations, and map their properties to "zh-TW" and "zh-CN" respectively; there's no place in the Locale constructor to specify the script code, and the way locales are resolved and inherit, means that using a script code in the country/region parameter will not work correctly with other built-in resources like number and date formats. Using the third parameter for variant codes is possible, but it is not used if you don't specify a country/region area, and if you want to get the correct number/date settings, you need to specify CN or TW anyway.

    For now, the only solution that works with the locale inheritance and resolution system is to create a new language code, made by concatenating the language code (in lower case) and the script code (with a leading capital, the other letters being left in lowercase).

    So something that works quite well is to use "zh" and "zhHant" for Simplified Chinese, and Traditional, resp. in the language code. I use the same trick for mapping Serbian Latin as "srLatn", keeping "sr" for legacy Serbian ressources created with the Cyrillic script.

    I really think that Sun should find a way to extend the support of locales, and the way they are resolved. There are various incompatible solutions found in several Java programs or libraries, but they all have the problem of not preserving the full compatibility with legacy data and applications; this is a problem for applications deployed in application servers, but this also affects some standalone applications; this is already a problem since long when the Java application is an applet running in the context of a browser: unless the users do specify explicitly which language they use, the default language selected depends on the browser parameters, and the way they are mapped to the default system Locale used in the JVM.

    Things would be easier if the RFC 3066 was updated now with clear guidelines about how to specify and handle the ISO 15924 script codes in locale IDs. Microsoft proposes something in .Net locales, but even this proposal is changing across applications (and Microsoft has changed several times its usage notices and recommandatation to programmers, and this .net class is also evolving in various directions, some of them are specific to Vista, and won't work on Windows XP, despite Vista is still a unsupported beta, not deployed to end-users.

    I think that even Microsoft does not know clearly how to build a clear framework design for handling multiple locales. Now that Iso 15924 is there since some time, it becomes urgent to create a working team for handling locale ids, to completely revamp RFC 3066 and to write its successor.

    The whole design of the new RFC should also address the complex cases with locale inheritance, user preferences, how to designate and use custom locales, how to handle legacy locale codes that contain codes that have been withdrawn from ISO3166, or ISO 639, how we will handle ISO 639-3 (still a beta too, but being more widely used now because it maps more languages than the previous ISO 639-1 and -2 parts), how to handle language families. But immediately, there's a strong need to handle the most common cases which are not handled very cleanly and interoperably :
    * Chinese: same written language, many oral variants, one script, two major usages of the script (simplified of traditional). What to do for Singapore where "zh-CN" or "zh-TW" is inappropriate, and "zh-SG" is most often not supported in the existing locale data, and inherits from "zh" which is most often the same as "zh-CN" (in most cases, "zh-CN" contains no locale data, all is in its parent "zh" locale)?
    * languages of countries of the former Yugoslavia: how to map the many legacy resources that have been built with "sh" (Serbo-Croatian), and then various codes ("sr", "bs", ...), and then handled in a more complex way when two separate usages orthographies appeared in the same area, with Latin or Cyrillic. How to handle regional variants (like Ziborian, which is quite far from official Serbian, but is still not Bosnian)

    These two areas are those for which designing localized data is the most problematic, with users complaining that they don't see the expected language or script, or with missing localized data replaced by a "root" default (most often in English), even though it would work better if the default was taken from a nearer language.

    In contrast, there are much less isues with resources for India, despite it has dozens of languages and so many scripts, and some of those languages can often use several scripts. I've never seen any one inIndia complaining that their language was written with the wrong script (and that they could not read the text), or may be, most Indians connected to the web can also read Hindi or English which have not problem for selecting the appropriate script.



    This archive was generated by hypermail 2.1.5 : Mon Jul 24 2006 - 04:50:36 CDT