transforms and language identifiers (was Re: Dozenal chars in music)

From: Mark Davis (
Date: Sun May 24 2009 - 14:51:31 CDT

  • Next message: Doug Ewell: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"

    Changed the name to better reflect the subject.

    On Sun, May 24, 2009 at 11:14, Julian Bradfield
    > wrote:

    > >I don't believe the Ethnologue does so. If it did, it would disagree with
    > >ISO and IETF BCP 47, in which en means any English; en-US, en-UK, ...
    > It seems rather silly to say you don't believe something which you can
    > trivially check.
    > Yes, it does encompass other dialects such as US Englishes, but
    > they're listed under "also spoken in", not in the head definition.

    Well, what can I say? Perhaps I am 'silly', but first, you assume that 'en'
    is defined as in the Ethnologue; and in this case, we don't; we follow IETF
    BCP 47 (more precisely, Unicode language IDs, which have some small
    variation from BCP 47). Secondly, you are looking at an out-of-date version
    of the Ethnologue; the current version is While it lists the UK
    first, also says that the population of *all (native) speakers* is
    309,352,280, and says that 210,000,000 speak English in the US. Those 210M
    speakers probably don't speak British English, I'm a guessin'.

    It is fine to root for the home team (or English variant), but UK English is
    not currently the most common form of English. And who knows, at some time
    in the future en-IN may be the most common form of English.

    > However, that doesn't really matter - all that matters is that en is
    > not identical to en-US, and your transformation varies between types
    > of en.
    > >Often one has to make a choice; for example, if I ask for an 'en' web
    > page,
    > >I need to get either en-US or en-UK, or en-CA, or en-AU, etc. If you know
    > >other information about the user, you may be able to pick the best one. In
    > >the absence of such information, the typical choice is to go with the
    > >variant with the most users: en-US for English, fr-FR for French, etc.
    > Different situation. If you claim to transform from en to X, your
    > transformation should be correct for anything that is en. If you can't
    > do that, because the transformation varies between subtypes of en, you
    > must include the subtype in the specified domain of your transformation.

    > Think of it in terms of subtyping in programming languages: if you ask
    > for an "en" Web page, then returning you an "en-UK" or "en-US" page is
    > fine, because "en-UK" is a subtype of "en". But if you have a function
    > that actually behaves correctly only on "en-US" arguments, it's unsound to
    > declare it to have argument type "en".

    First, that is not how locale models work; programming language subclassing
    is not particularly analogous to the situation. What locale models do is
    give the user "the best shot". If I ask for en-UK-x-Yorkshire, it gives me
    the most common variant of Yorkshire if available, otherwise the most common
    variant of en-UK, otherwise the most common variant of en. That way you get
    something as close to the request as possible; and your user doesn't just
    get a 404 <> if you don't have an exact
    match. A second part of that locale model is that you can also query (in
    APIs) what was returned, and decide on that basis if you want to do
    something special (like tell the user that they aren't getting an exact
    match, or throw them a 404 <>). Google
    "locale inheritance model" for more info.

    Secondly, of course, all language tags are approximations. The code en-UK is
    not uniform in denotation. If you mean RP, according to the BL, it is spoken
    by as few as 2% of the UK population (
    You say "If you claim to transform from en to X, your transformation should
    be correct for anything that is en." If we followed your argument, the
    transformation for en-UK should be correct for anything that is "en-UK". By
    that account, one couldn't use "en-UK" either in mapping to IPA, since it is
    not completely determinant; it means any of the variants of English as
    spoken in the UK. If we followed your logic to the bitter end, we'd have to
    specify down to the very narrow dialect, maybe even idiolect. That's simply
    not a practical model.

    Thirdly, you use the phrasing: "you **must** include the subtype". That
    presumes some kind of consequence. Examples:

    You must include the subtype, or...

       - you are not conformant with the Ethnologue (false, btw).
       - you are not conformant with IETF BCP 47 (false, btw).
       - ...
       - we will send Guido around to have a bit of a chat.
       - ...

    It is very unclear exactly which of these you are talking about. (I'm hoping
    not the Guido one).

    > --
    > The University of Edinburgh is a charitable body, registered in
    > Scotland, with registration number SC005336.

    This archive was generated by hypermail 2.1.5 : Sun May 24 2009 - 14:54:15 CDT