Re: transforms and language identifiers (was Re: Dozenal chars in music)

From: Mark Davis (
Date: Tue May 26 2009 - 20:35:30 CDT

  • Next message: Michael Everson: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"


    On Sun, May 24, 2009 at 15:10, Julian Bradfield
    > wrote:

    > >Changed the name to better reflect the subject.
    > And it's even working back to Unicode, via locales!
    > >Well, what can I say? Perhaps I am 'silly', but first, you assume that
    > 'en'
    > >is defined as in the Ethnologue; and in this case, we don't; we follow
    > IETF
    > That was an assumption which I put into the first message! On the
    > basis that ISO doesn't define it. I admit that I'd never heard of BCPs
    > - I'm sure more people know about Ethnologue than BCP 47, by some
    > orders of magnitude, so I view best current practice as following
    > Ethnologue.

    Actually, when it comes to programmers, far more people have heard of BCP 47
    (and the associated RFC 4646, and the former versions RFC 3166 and 1766)
    than have heard of the Ethnologue, since those RFCs are the standards
    required by HTML, XML, Unicode locales, Windows, .... The Ethnologue, as
    great a resource as it is, doesn't set the standard for identifiers of
    languages used by programmers.

    > >It is fine to root for the home team (or English variant), but UK English
    > is
    > >not currently the most common form of English. And who knows, at some time
    > >in the future en-IN may be the most common form of English.
    > The most common is probably already that version of English spoken by
    > Chinese learners!

    Good point; we restrict ourselves to fluent speakers (ideally it would be
    something like 7-day active users as well, but that data is hard to come

    > >First, that is not how locale models work; programming language
    > subclassing
    > >is not particularly analogous to the situation. What locale models do is
    > >give the user "the best shot". If I ask for en-UK-x-Yorkshire, it gives me
    > >the most common variant of Yorkshire if available, otherwise the most
    > common
    > >variant of en-UK, otherwise the most common variant of en. That way you
    > get
    > >something as close to the request as possible; and your user doesn't just
    > >get a 404 <> if you don't have an
    > exact
    > >match. A second part of that locale model is that you can also query (in
    > >APIs) what was returned, and decide on that basis if you want to do
    > >something special (like tell the user that they aren't getting an exact
    > >match, or throw them a 404 <>).
    > Google
    > >"locale inheritance model" for more info.
    > That's still a covariant situation: the user's asking for an output
    > locale, and you're saying "can't do it, here's something that is at
    > least a cousin of your requested type, are you happy?". Your
    > transformation program is contravariant in its input. It takes input,
    > which may be specified as en-UK, or en-SG, or whatever, transforms it
    > to the output locale fonipa, and silently *gives the wrong answer* --
    > not a "best shot" answer with notification. (An especially unfortunate
    > answer, since the GA /fɑks/ sounds to British ears more like RP /fʌks/
    > than /fɒks/.) Unless the output "locale" is labelled as being a
    > representation of the "en-US" version of the input, the user isn't
    > getting the information you claim that the locale models should give -
    > and if you do label the output as being a representation of en-US,
    > then you might better declare up front that the input locale is en-US.

    The API does not actually do that. The API actually returns precisely which
    one was chosen, so the user has a choice, as I said, of discarding the
    transform, or using it. So you can ask for "en_GB-ipa". We don't have one
    available currently, so you would get back "en-ipa". According to the CLDR
    data, mechanically readable, that is equivalent to *a* en_US-ipa transform.
    You can at that point simply reject it, and tell your user it there is
    nothing available, if you judge that it is better to fail than to return a
    different variant of English than you want.

    I am simply not able to make myself clear on this issue, so perhaps it is
    not worth pursuing.

    > >Secondly, of course, all language tags are approximations. The code en-UK
    > is
    > >not uniform in denotation. If you mean RP, according to the BL, it is
    > spoken
    > >by as few as 2% of the UK population (
    > >
    > ).
    > Of course. But most en-UK speakers accept RP as a reference standard
    > pronunciation, although they no longer consider it a normative
    > standard. Likewise people accept GA as an American reference standard,
    > not a normative standard.

    I can't speak to the former, but as to the latter; I don't know that the
    average non-GA American would necessarily consider it a "the reference

    > I think it's not entirely clear whether UK or US English is viewed as
    > the reference standard for English, if you're only interested in
    > numbers. US clearly dominates the native-English-speaking world, but
    > probably many of the L2 English speakers still think of UK English as
    > a nearer reference standard than US, especially in those places where
    > there are many L2 speakers.

    If you have some hard figures on that it would be useful to consider them.

    > >You say "If you claim to transform from en to X, your transformation
    > should
    > >be correct for anything that is en." If we followed your argument, the
    > >transformation for en-UK should be correct for anything that is "en-UK".
    > By
    > >that account, one couldn't use "en-UK" either in mapping to IPA, since it
    > is
    > >not completely determinant; it means any of the variants of English as
    > >spoken in the UK. If we followed your logic to the bitter end, we'd have
    > to
    > >specify down to the very narrow dialect, maybe even idiolect. That's
    > simply
    > >not a practical model.
    > No, that argument doesn't fly, because the output may be at a level (a
    > broad phonemic transcription) that covers all of en-UK. (Some dialects
    > make distinctions that RP doesn't, so you'd need to make those
    > distinctions to get it really right.) Indeed, such a transcription
    > could also suffice for GA - you can convert from RP to GA pretty
    > well. But a GA transcription has less information than an RP
    > transcription, so can't be transformed to be right for RP.
    > Similarly, such an transcription should include all the /r/s, even
    > those that non-rhotic speakers (e.g. RP) don't pronounce, because
    > non-rhotic
    > speakers can remove the /r/s, but rhotic speakers can't insert /r/s
    > that aren't there in the transcription.

    I don't know that that is really the case. And you can't reliably transform
    from RP to GA (or the reverse). See, for example, and

    Wells has, for example,

    ɑːstart, father ɔːthought, law, north, war
    You couldn't map those reliably to GA, because some are rhotic and some are
    not. And there are some cases that are even clearer, like "privacy".

    > In fact, your system already does some of this: it tranforms
    > When will Merry Mary marry?
    > to
    > wɛn wɪl mɛri meri mæri?
    > although most Americans don't make the three-way distinction.

    That is because the data table that it is based on tries to use rules where
    possible, and falls back to an exception table only where really necessary.
    And, as I said multiple times, it is only draft data.

    Myself, I'd pronounce them all mɛri.

    > All this kind of stuff has of course been considered ad nauseam in the
    > various proposals for "phonetic" orthographies for English. Or even by
    > lexicographers - some dictionaries avoid giving separate UK and US
    > pronunciations by using a system that can be mapped to either.

    see above.

    > Anyway, perhaps the real issue is that doing en-ipa as an example of
    > Unicode transliteration is a weird idea! IPA is about transcription of
    > spoken language, not transliteration of written language. Transforming
    > from en to ipa by transcribing some random dialectal pronunciation of
    > the written input is on a par with transforming from en to fr by
    > translating
    > it, which is surely beyond the scope of Unicode transforms!

    The goal of this test was to have a pivot for transforms, so that we could
    have a relatively simple mechanism for getting a wide variety of
    transcriptions, rather than having to hand-craft each pair. Example:

    > >Thirdly, you use the phrasing: "you **must** include the subtype". That
    > >presumes some kind of consequence. Examples:
    > ... or you give the user the wrong answer without telling them so.

    As I tried to make clear, the convention is clear, and the user of the
    services has the choice of what to do with the answer. It is more like
    someone's asking:

    Q. What is the temperature in American right now?

    A. Well, we don't know what the temperature is at all points in America, but
    in Menlo Park is 72.

    If the information in the answer is useful to you, you can use it; if not,
    you don't have to.

    > --
    > The University of Edinburgh is a charitable body, registered in
    > Scotland, with registration number SC005336.

    This archive was generated by hypermail 2.1.5 : Tue May 26 2009 - 20:39:11 CDT