transforms and language identifiers (was Re: Dozenal chars in music)

From: Mark Davis (mark.edward.davis@gmail.com)
Date: Sun May 24 2009 - 14:51:31 CDT

Next message: Doug Ewell: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"

Previous message: Roozbeh Pournader: "Re: Pb with Unicode Tifinagh with Internet Explorer"
Next in thread: Julian Bradfield: "transforms and language identifiers (was Re: Dozenal chars in music)"
Reply: Julian Bradfield: "transforms and language identifiers (was Re: Dozenal chars in music)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Changed the name to better reflect the subject.

On Sun, May 24, 2009 at 11:14, Julian Bradfield
<jcb+unicode@inf.ed.ac.uk<jcb%2Bunicode@inf.ed.ac.uk>
> wrote:

> >I don't believe the Ethnologue does so. If it did, it would disagree with
> >ISO and IETF BCP 47, in which en means any English; en-US, en-UK, ...
>
> It seems rather silly to say you don't believe something which you can
> trivially check.
> http://www.ethnologue.com/14/show_iso639.asp?code=en
> Yes, it does encompass other dialects such as US Englishes, but
> they're listed under "also spoken in", not in the head definition.

Well, what can I say? Perhaps I am 'silly', but first, you assume that 'en'
is defined as in the Ethnologue; and in this case, we don't; we follow IETF
BCP 47 (more precisely, Unicode language IDs, which have some small
variation from BCP 47). Secondly, you are looking at an out-of-date version
of the Ethnologue; the current version is
http://www.ethnologue.com/show_language.asp?code=eng. While it lists the UK
first, also says that the population of *all (native) speakers* is
309,352,280, and says that 210,000,000 speak English in the US. Those 210M
speakers probably don't speak British English, I'm a guessin'.

It is fine to root for the home team (or English variant), but UK English is
not currently the most common form of English. And who knows, at some time
in the future en-IN may be the most common form of English.

> However, that doesn't really matter - all that matters is that en is
> not identical to en-US, and your transformation varies between types
> of en.
>
> >Often one has to make a choice; for example, if I ask for an 'en' web
> page,
> >I need to get either en-US or en-UK, or en-CA, or en-AU, etc. If you know
> >other information about the user, you may be able to pick the best one. In
> >the absence of such information, the typical choice is to go with the
> >variant with the most users: en-US for English, fr-FR for French, etc.
>
> Different situation. If you claim to transform from en to X, your
> transformation should be correct for anything that is en. If you can't
> do that, because the transformation varies between subtypes of en, you
> must include the subtype in the specified domain of your transformation.

>
> Think of it in terms of subtyping in programming languages: if you ask
> for an "en" Web page, then returning you an "en-UK" or "en-US" page is
> fine, because "en-UK" is a subtype of "en". But if you have a function
> that actually behaves correctly only on "en-US" arguments, it's unsound to
> declare it to have argument type "en".

First, that is not how locale models work; programming language subclassing
is not particularly analogous to the situation. What locale models do is
give the user "the best shot". If I ask for en-UK-x-Yorkshire, it gives me
the most common variant of Yorkshire if available, otherwise the most common
variant of en-UK, otherwise the most common variant of en. That way you get
something as close to the request as possible; and your user doesn't just
get a 404 <http://en.wikipedia.org/wiki/HTTP_404> if you don't have an exact
match. A second part of that locale model is that you can also query (in
APIs) what was returned, and decide on that basis if you want to do
something special (like tell the user that they aren't getting an exact
match, or throw them a 404 <http://en.wikipedia.org/wiki/HTTP_404>). Google
"locale inheritance model" for more info.

Secondly, of course, all language tags are approximations. The code en-UK is
not uniform in denotation. If you mean RP, according to the BL, it is spoken
by as few as 2% of the UK population (
http://www.bl.uk/learning/langlit/sounds/case-studies/received-pronunciation/).
You say "If you claim to transform from en to X, your transformation should
be correct for anything that is en." If we followed your argument, the
transformation for en-UK should be correct for anything that is "en-UK". By
that account, one couldn't use "en-UK" either in mapping to IPA, since it is
not completely determinant; it means any of the variants of English as
spoken in the UK. If we followed your logic to the bitter end, we'd have to
specify down to the very narrow dialect, maybe even idiolect. That's simply
not a practical model.

Thirdly, you use the phrasing: "you **must** include the subtype". That
presumes some kind of consequence. Examples:

You must include the subtype, or...

   - you are not conformant with the Ethnologue (false, btw).
   - you are not conformant with IETF BCP 47 (false, btw).
   - ...
   - we will send Guido around to have a bit of a chat.
   - ...

It is very unclear exactly which of these you are talking about. (I'm hoping
not the Guido one).

>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>

Next message: Doug Ewell: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"
Previous message: Roozbeh Pournader: "Re: Pb with Unicode Tifinagh with Internet Explorer"
Next in thread: Julian Bradfield: "transforms and language identifiers (was Re: Dozenal chars in music)"
Reply: Julian Bradfield: "transforms and language identifiers (was Re: Dozenal chars in music)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun May 24 2009 - 14:54:15 CDT