Re: Common Locale Data Repository Project

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 23 2004 - 20:15:52 EDT

  • Next message: Asmus Freytag: "Re: Unihan.txt and the four dictionary sorting algorithm"

    From: "Peter Constable" <petercon@microsoft.com>
    > > I think that the CLDR database is extremely important for software
    > > implementations, because it avoids some caveats that come from other
    > unstable
    > > standards such as ISO 3166 and ISO 639.
    >
    > ISO 639 is not unstable. It is an open code set that is being added to
    > over time, but I don't think that should be referred to as unstable --
    > that term suggests other things.

    By unstable I mean in fact ambiguous, even for the correct designation of
    languages with a code that can be recognized. Even the proposal to supercede ISO
    3066 with new tags has its caveats: which code must an application use when it
    already defines multiple ones (is this number bound?) to refer to the same
    language.

    The problem comes within Softwares when a user will specify a prefered language
    in his locale with a code that will not be understood by an application that
    just understands another one. This becomes worse when one software will require
    one code in the user's locale to support a language and another will require
    another code in the user's locale to support the same language.

    Look for example the case of Norwegian: is it no, nn or nb or no-nynorks or
    no-bokmal ?
    Even with the algorithm based on common prefixes, you won't be able to match
    them all. So there's a need to specify an algorithms that allows aliases to be
    resolved. With multi-tags language identifiers the resolution order becomes
    unpredictable if one supports aliases for one subtag and not the other.

    What is already unstable in ISO639 is the deprecation of "iw" and the addition
    of "he", same thing for "in" and "id" or for "yi" and "ji". Don't you call that
    unstability? OK these codes are deprecated, not reassigned. But they still cause
    problems.

    Think more recently about the new codification for Serbo-Croatian, and the split
    of "sh", with no definition except that it is country based (Serbian, Croatian,
    Bosnian, Montenegrin), assimuming that one country uses only one language when
    in fact there are several in the same one, that are shared by multiple
    countries, and differ mostly by their script...

    Also if ISO3166 is unstable (CS: is that the former Czechoslovakia or the newer
    Serbia-Montenegro?), then it introduces unstability too within ISO 3066 or its
    proposed replacement... for the indentification of languages.

    For now, the only workable solution to solve these issues is found in
    supplementary libraries in ICU which support locale aliases. (Yes I use the
    terme Locale because this is the term that Java gives to this identification,
    based on a language code consisting into a single subtag, a country/territory
    code and a variant code with possibly multiple subtags, and no reference to the
    needed script code; I wonder how the newer RFC 3066 model will fit here).



    This archive was generated by hypermail 2.1.5 : Fri Apr 23 2004 - 20:42:44 EDT