Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)

From: Philippe Verdy (
Date: Wed Jun 14 2006 - 16:45:09 CDT

  • Next message: Philippe Verdy: "Re: NFD on u+AC00 contradicts NormalisationData.txt ?"

    I already submitted some remarks there, but it's been a long time, and the CLDR has evolved (as well as the ICU library) and my initial comments may look outdated regarding the new developments.

    But this bug repport is not really discussing the fallback mechanism from one language to a language family, but from a variant to a language, or the fallback for languages that have multiple codes or legacy codes (he/iw, in/id) as seen in Java VMs where the legacy codes (like iw, in) are still the only one working given that it preserves the compatibility of old applicaitons that depended on them for finding their resources with the standard class loader of Java 1.3/1.4 (and even 5.0).

    I still hope that the successor of RFC 3066 will come soon to describe correctly the new locale identifiers (and especially the new ISO 15924 field for the indication of scripts).

    But gien that ISO 639-3 is still not finalized, it will be hard to find a definitive solution for designating locales and all their known aliases, and still preserve the compatibility of legacy applications depending on these identifiers.

    ICU for now proposes a temporary solution for resolving the resource fallback path, but it certainly requires more thoughts to handle all possible cases (and the interaction of language identifiers with ISO 3166 country/region identifiers, or the new aliases introduced now by deprecating the ISO 3166 country/region identifiers in favor of more precise ISO 639-3 language identifiers);

    The current locale fallback mechanism implemented in legacy applications is most often fixed and various systems use different fallback algorithms to determine alternate locales. In Java for example, this mechanism also interacts not only with the user settings, but also with the local system settings, when no user locale matches with a given resource id. But there's still no way in Java to go after the first field of the locale id, as its parent is a single root, and not another locale.

    Even the java Locale class still does not include a constructor to specify the script identifier (one could specify it in the variant identifier, but its place at the third position after the country identifier is not the best one for correct locale resolution, as this should be on the second place between the language code and the region code). If one uses the field normally reserved for the country to set the script code, it won't interact cleanly with legacy applications that use country codes.

    So one must use its own class cloader, using its own fallback mechanism, and create a new class to extend the Locale object, and implement variuous tricks to make it work with the standard locale interface. This is more or less what ICU does to support extended locale identifiers and aliases.
      ----- Original Message -----
      From: Mark Davis
      To: Philippe Verdy
      Cc: Erkki Kolehmainen ; Cristian Secară ;
      Sent: Wednesday, June 14, 2006 9:08 PM
      Subject: Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)

      There is a planned mechanism: see

      (This was planned for 1.4, but delayed since we didn't have enough data to warrent adding the mechanism.)

    This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 16:49:12 CDT