Re: Common Locale Data Repository Project

From: John Cowan (cowan@ccil.org)
Date: Fri Apr 23 2004 - 21:28:21 EDT

  • Next message: Mark Davis: "Re: Common Locale Data Repository Project"

    Philippe Verdy scripsit:

    > By unstable I mean in fact ambiguous, even for the correct designation
    > of languages with a code that can be recognized. Even the proposal to
    > supercede ISO 3066 with new tags has its caveats: which code must an
    > application use when it already defines multiple ones (is this number
    > bound?) to refer to the same language.

    RFC 3066 always requires that the 2-letter code be used in place of either
    3-letter code if it exists. In all other cases, there is only one 3-letter
    code, and it is used.

    Some codes are vague, in the sense that they do not fully specify which
    language is in use. For that reason, ISO 639-3 is being defined as an
    upward compatible extension of ISO 639-2.

    > Look for example the case of Norwegian: is it no, nn or nb or no-nynorks or
    > no-bokmal ?

    There are two issues here: no-nynorsk and no-bokmal are now deprecated
    codes: that is, no application should require them, every application
    thta accepts nn or nb should accept them, no application should produce
    them. Older versions will be less forgiving and should be upgraded.

    The second is that no is unique, or nearly so: it designates nn and nb
    jointly. Now everyone who can read one can read the other, so Norwegian
    applications should accept any of no, nb, nn in data. But no is meaningless
    to a spell-checker, which should require either nb or nn.

    > What is already unstable in ISO639 is the deprecation of "iw" and
    > the addition of "he", same thing for "in" and "id" or for "yi" and
    > "ji". Don't you call that unstability? OK these codes are deprecated,
    > not reassigned. But they still cause problems.

    Not really. Again, all applications should generate he and accept
    both iw and he.

    > Also if ISO3166 is unstable (CS: is that the former Czechoslovakia
    > or the newer Serbia-Montenegro?), then it introduces unstability too
    > within ISO 3066 or its proposed replacement... for the indentification
    > of languages.

    ISO 3066bis specifies that CS will always mean Czechoslovakia, and the
    highly stable 3-digit code will be used for Serbia-Montenegro.

    > For now, the only workable solution to solve these issues is found in
    > supplementary libraries in ICU which support locale aliases. (Yes I
    > use the terme Locale because this is the term that Java gives to this
    > identification, based on a language code consisting into a single
    > subtag, a country/territory code and a variant code with possibly
    > multiple subtags, and no reference to the needed script code; I wonder
    > how the newer RFC 3066 model will fit here).

    Language specifiers are conceptually different from locale specifiers.
    One might specify a locale of da_us to mean Danish language, U.S.
    measurement systems, but the language da-us would be the U.S. dialect
    of Danish, a very different thing.

    -- 
    John Cowan  www.ccil.org/~cowan  www.reutershealth.com  jcowan@reutershealth.com
    In might the Feanorians / that swore the unforgotten oath
    brought war into Arvernien / with burning and with broken troth.
    and Elwing from her fastness dim / then cast her in the waters wide,
    but like a mew was swiftly borne, / uplifted o'er the roaring tide.
            --the Earendillinwe
    


    This archive was generated by hypermail 2.1.5 : Fri Apr 23 2004 - 21:56:23 EDT