transforms and language identifiers (was Re: Dozenal chars in music)

From: Julian Bradfield (jcb+unicode@inf.ed.ac.uk)
Date: Sun May 24 2009 - 17:10:25 CDT

Next message: David Starner: "Re: Dozenal chars in music"

Previous message: Doug Ewell: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"
In reply to: Mark Davis: "transforms and language identifiers (was Re: Dozenal chars in music)"
Next in thread: Peter Constable: "RE: transforms and language identifiers (was Re: Dozenal chars in music)"
Reply: Peter Constable: "RE: transforms and language identifiers (was Re: Dozenal chars in music)"
Reply: Mark Davis: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>Changed the name to better reflect the subject.

And it's even working back to Unicode, via locales!

>Well, what can I say? Perhaps I am 'silly', but first, you assume that 'en'
>is defined as in the Ethnologue; and in this case, we don't; we follow IETF

That was an assumption which I put into the first message! On the
basis that ISO doesn't define it. I admit that I'd never heard of BCPs
- I'm sure more people know about Ethnologue than BCP 47, by some
orders of magnitude, so I view best current practice as following
Ethnologue.

>It is fine to root for the home team (or English variant), but UK English is
>not currently the most common form of English. And who knows, at some time
>in the future en-IN may be the most common form of English.

The most common is probably already that version of English spoken by
Chinese learners!

>First, that is not how locale models work; programming language subclassing
>is not particularly analogous to the situation. What locale models do is
>give the user "the best shot". If I ask for en-UK-x-Yorkshire, it gives me
>the most common variant of Yorkshire if available, otherwise the most common
>variant of en-UK, otherwise the most common variant of en. That way you get
>something as close to the request as possible; and your user doesn't just
>get a 404 <http://en.wikipedia.org/wiki/HTTP_404> if you don't have an exact
>match. A second part of that locale model is that you can also query (in
>APIs) what was returned, and decide on that basis if you want to do
>something special (like tell the user that they aren't getting an exact
>match, or throw them a 404 <http://en.wikipedia.org/wiki/HTTP_404>). Google
>"locale inheritance model" for more info.

That's still a covariant situation: the user's asking for an output
locale, and you're saying "can't do it, here's something that is at
least a cousin of your requested type, are you happy?". Your
transformation program is contravariant in its input. It takes input,
which may be specified as en-UK, or en-SG, or whatever, transforms it
to the output locale fonipa, and silently *gives the wrong answer* --
not a "best shot" answer with notification. (An especially unfortunate
answer, since the GA /fɑks/ sounds to British ears more like RP /fʌks/
than /fɒks/.) Unless the output "locale" is labelled as being a
representation of the "en-US" version of the input, the user isn't
getting the information you claim that the locale models should give -
and if you do label the output as being a representation of en-US,
then you might better declare up front that the input locale is en-US.

>Secondly, of course, all language tags are approximations. The code en-UK is
>not uniform in denotation. If you mean RP, according to the BL, it is spoken
>by as few as 2% of the UK population (
>http://www.bl.uk/learning/langlit/sounds/case-studies/received-pronunciation/).

Of course. But most en-UK speakers accept RP as a reference standard
pronunciation, although they no longer consider it a normative
standard. Likewise people accept GA as an American reference standard,
not a normative standard.

I think it's not entirely clear whether UK or US English is viewed as
the reference standard for English, if you're only interested in
numbers. US clearly dominates the native-English-speaking world, but
probably many of the L2 English speakers still think of UK English as
a nearer reference standard than US, especially in those places where
there are many L2 speakers.

>You say "If you claim to transform from en to X, your transformation should
>be correct for anything that is en." If we followed your argument, the
>transformation for en-UK should be correct for anything that is "en-UK". By
>that account, one couldn't use "en-UK" either in mapping to IPA, since it is
>not completely determinant; it means any of the variants of English as
>spoken in the UK. If we followed your logic to the bitter end, we'd have to
>specify down to the very narrow dialect, maybe even idiolect. That's simply
>not a practical model.

No, that argument doesn't fly, because the output may be at a level (a
broad phonemic transcription) that covers all of en-UK. (Some dialects
make distinctions that RP doesn't, so you'd need to make those
distinctions to get it really right.) Indeed, such a transcription
could also suffice for GA - you can convert from RP to GA pretty
well. But a GA transcription has less information than an RP
transcription, so can't be transformed to be right for RP.
Similarly, such an transcription should include all the /r/s, even
those that non-rhotic speakers (e.g. RP) don't pronounce, because non-rhotic
speakers can remove the /r/s, but rhotic speakers can't insert /r/s
that aren't there in the transcription.

In fact, your system already does some of this: it tranforms
When will Merry Mary marry?
to
wɛn wɪl mɛri meri mæri?
although most Americans don't make the three-way distinction.

All this kind of stuff has of course been considered ad nauseam in the
various proposals for "phonetic" orthographies for English. Or even by
lexicographers - some dictionaries avoid giving separate UK and US
pronunciations by using a system that can be mapped to either.

Anyway, perhaps the real issue is that doing en-ipa as an example of
Unicode transliteration is a weird idea! IPA is about transcription of
spoken language, not transliteration of written language. Transforming
from en to ipa by transcribing some random dialectal pronunciation of
the written input is on a par with transforming from en to fr by translating
it, which is surely beyond the scope of Unicode transforms!

>Thirdly, you use the phrasing: "you **must** include the subtype". That
>presumes some kind of consequence. Examples:

... or you give the user the wrong answer without telling them so.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Next message: David Starner: "Re: Dozenal chars in music"
Previous message: Doug Ewell: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"
In reply to: Mark Davis: "transforms and language identifiers (was Re: Dozenal chars in music)"
Next in thread: Peter Constable: "RE: transforms and language identifiers (was Re: Dozenal chars in music)"
Reply: Peter Constable: "RE: transforms and language identifiers (was Re: Dozenal chars in music)"
Reply: Mark Davis: "Re: transforms and language identifiers (was Re: Dozenal chars in music)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun May 24 2009 - 17:13:35 CDT