Re: Serbian-Latin "sh" alias and ISO-639-1 within CLDR

From: Mark Davis (
Date: Wed Mar 16 2005 - 12:27:13 CST

  • Next message: Philippe VERDY: "Re: Decomposition vs Full decomposition?"

    > >Let me try this one more time. "sh" was fairly widely used to stand for
    > >Serbian written in Latin.
    > What?
    > Where? When? By whom?

    Let me be a bit more explicit. 'sh' does mean Serbo-Croatian, and nothing we
    are doing denies that. The issue revolves around the different usages of
    language tags.

    1. Matching. Use of 'sh' says that you want to match Serbo-Croatian of
    whatever form: doesn't depend on the country, doesn't depend on the script.
    Use of, say, sh-Latn-IT means that you want Serbo-Croatian, but limited to
    Latin script, and as used in Italy. If I want to have a document query, then
    I could use one of the above to restrict the documents that match my query.
    I could get back 1000 documents or none, according to what database of
    documents I am searching and how restrictive my query is.

    2. Lookup. When you lookup language data, for example, for display on a web
    page, it is a different process. You typically don't have the choice of
    displaying nothing if there is not an exact match. Instead, you fall back.
    If you don't have data exactly matching sh-Latn-IT, you fallback to data for
    sh-Latn. If you don't have that either, you fallback to data for sh. Now,
    whatever data contents someone has associated with 'sh', it has to be a
    single consistent type of data, so it will be in one of Latn or Cyrl. What
    CLDR does-- for the contents of the data associated with 'sh' -- is use
    Serbian data in Latin script.

    This is no different than, for example, the use of say American English in
    data associated with en for lookup purposes. Any distinctions according to
    country would be stored separately in an en-AU, en-CA, en-IE, etc., and if
    available, would be found in a lookup. But if someone came in with en-JP,
    and there was no separate data source for that, it would fall back to the
    data associated with 'en', which would be American English. That doesn't
    imply that CLDR is treating 'en' as equivalent to 'en-US' in terms of the
    semantics of the tags -- it is not.


    ----- Original Message -----
    From: "Mark Davis" <>
    To: "Unicode Discussion" <>; "Michael Everson"
    Sent: Monday, March 14, 2005 22:37
    Subject: Re: Serbian-Latin "sh" alias and ISO-639-1 within CLDR

    > Well, reality appears to be rather fluid. Mysteriously the single language
    > Serbo-Croatian suddenly split into two languages about ten years ago. We
    > somedy look back on on the day when the Californian language split off
    > English after the War of Pacific Secession.
    > ‎Mark
    > ----- Original Message -----
    > From: "Michael Everson" <>
    > To: "Unicode Discussion" <>
    > Sent: Monday, March 14, 2005 18:08
    > Subject: Re: Serbian-Latin "sh" alias and ISO-639-1 within CLDR
    > > At 18:00 -0800 2005-03-14, Mark Davis wrote:
    > > >Let me try this one more time. "sh" was fairly widely used to stand for
    > > >Serbian written in Latin.
    > >
    > > What?
    > > Where? When? By whom?
    > >
    > > "sh" was used to tag tens or hundreds of thousands of books worldwide
    > > in "Serbo-Croatian", which means Serbian or Croatian, in Latin or
    > > Cyrillic, for DECADES. There are far more many examples of hr-Latn
    > > and sr-Cyrl that were tagged as sh than there are either of hr-Cyrl
    > > or sr-Latin.
    > >
    > > >We do not defend that usage, but for backwards compatibility we've
    > > >maintained it in CLDR. Our recommendation, as I have stated, is to
    > > >use sr-Latn instead of "sh" for that usage.
    > >
    > > That particular recommendation seems to have little to do with reality.
    > > --
    > > Michael Everson * * Everson Typography * *
    > >
    > >

    This archive was generated by hypermail 2.1.5 : Wed Mar 16 2005 - 12:28:00 CST