Re: Serbian-Latin "sh" alias and ISO-639-1 within CLDR

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 14 2005 - 13:40:44 CST

  • Next message: Peter Kirk: "Re: Serbian-Latin "sh" alias and ISO-639-1 within CLDR"

    From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
    > On Mon, 14 Mar 2005, Philippe Verdy wrote:
    >
    >> I have just seen in the CLDR repository a reference to the 2-letter code
    >> "sh" used as an alias for the Serbian language with the Latin variant.
    >
    > The code "sh" was assigned to Serbo-Croatian. It was deprecated
    > 2000-02-18 in favor of the codes "sr" for Serbian, "hr" for Croatian.
    > I suppose the political issues behind this are widely known.
    > As far as I can see, "sh" was a code for Serbo-Croatian irrespective of
    > the writing system (script).
    >
    >> According to ISO-639-1, "sh" does not seem assigned, but it may be still
    >> an
    >> interesting code for software localization purpose, because using "hr"
    >> (Croatian) for handling the Serbian vocabulary which shares the same
    >> Latin
    >> script does not seem appropriate, and using "sr" is already needed for
    >> localizing software to traditional Serbian Cyrillic.
    >
    > For new data, "hr" and "sr" are to be used, and they indicate language
    > forms, not necessarily implying a writing system. When Serbian is written
    > in Latin letters, then the script can be specified separately, instead of
    > encoding it into the primary language code.

    Unfortunately, script selection is not available in many localization APIs
    (at least in Java which just considers locale fields for:
    - language code, according to ISO-639-1 or -2
    - country/region code, according to ISO-3166 (but with lots of caveats
    because of its instability and the act that if it is used to differenciate
    languages/scripts then it looses its ability to designate the effective
    country/region (see for example zh_TW used to designate in fact Traditional
    Chinese, whever it is used in mainland Southern China, Hong Kong, or
    Taiwan...)
    - variant code, which obeys to no standard, and just used to tweak resources
    in non interoperable ways

    I expect that the future ISO locale code standard will not only standardize
    the new form of locale codes, but a *working* API or algorithm to correctly
    match locales in all their aspects: linguistic, orthographic (script), legal
    (countries)... Parsing locale codes should not require manual tweaks in
    every application, notably one should be able to set a user locale that
    would work independantly of the target application that would use it. I am
    really not satistified with the two simplistic algorithm present for now in
    Java...

    If "sh" is effectively deprecated, this alias in CLDR may simplify the
    distinction between Serbian Cyrillic (sr) and Serbian Latin (sh), leaving
    Bosnian Latin with its code (bs), as well as Croatian (hr), and without
    needing to manage script codes...

    I am much less concerned about the legacy use of "sh" which was ambiguous
    (was Serbo-Croatian labelled with "sh" really Latin in fact?) and does not
    seem to conflict to a more precise use of this code for modern applications
    that need a distinction between the two scripts used for Serbian... as a
    transitory measure, the alias has its utility because it helps
    disambiguating things...



    This archive was generated by hypermail 2.1.5 : Mon Mar 14 2005 - 13:41:24 CST