Re: New extension for transformed languages

From: Philippe Verdy <>
Date: Sun, 4 Mar 2012 23:56:52 +0100

This looks OK then: "ru-t-it" would implicitly mean " ru-Cyrl-t-it-Latn" only
because the IANA registry indicates that the Cyrillic script is implied for
Russian (when not specified), and that the Latin script is implied for
Italian (when not specified in the "t" extension.

This will work however only if there are registrations for both languages.
Many languages do not have such implicit script registered in the IANA
database (e.g. Breton should be implicitly using the Latin script, other
situations are almost impossible to find in actual usage or in
publications, except may be if there's some language courses of Breton in
Japan or Russia; well the implied "Latn" script was not registered in the
"br" language subtag the last time I checked it in the IANA registry).

For the general case, a tag of the form "xx-t-yy" where xx and yy and only
language subtags (2 or 3 letters), will not indicate that there was any
transliteration applied as it is not possible to infer the pair of scripts.

You may just get the script of the target language from the text content
(if there is a textual content and it is using a text encoding form and it
is not just an image facsimile or audio record). But nothing about the
initial script (it is not even sure that this was a written form, it could
also be a transcription from an audio record in a foreign language).

To get the most benefit of the "xx-t-yy" form, the IANA registry should
contain many more information about implied scripts for the existing
registered language subtags.

Le 4 mars 2012 20:42, Mark Davis ☕ <> a écrit :

> ru-t-it indicates a transform from Italian to Russian. The exact mechanism
> is unspecified. For more specific transformations (such as an UNGEGN
> transliteration), that would need to be added.
> Normally Russian is written in Cyrillic, and Italian written in Latin
> characters. For unusual cases, you'd specify the script (eg ru-Arab-t-it or
> ru-Latn-t-it-Kana). Otherwise the presumption is ru-Cyrl-it-Latn: a
> transform from Italian (written in Latin characters) to Russian (written in
> Cyrillic characters).
> ------------------------------
> Mark <>
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
> On Sun, Mar 4, 2012 at 09:33, Philippe Verdy <> wrote:
>> If the "t" singleton subtag is used to indicate a transliteration to
>> letters used in another language, this does not make clear where to
>> indicate in which script those letters are selected.
>> As the "t" singleton (with its value being another language code) is an
>> extension, it should appear at end of the tag,
>> so that "ru-t-it" actually means (if the Latin script is implied for
>> Italian) "ru-Latn-t-it", preferable to "ru-t-it-Latn" which is not ordered
>> correctly.
>> One problem is that there's no mechanism to imply the script from the
>> target language of the transliteration. The implicit script is only derived
>> from the first language code, not from an extension.
>> This means that "ru-t-it" can't be made formally equivalent
>> to "ru-Latn-t-it", because the implied script for Russian is still Cyrillic
>> !
>> So "ru-t-it" still actually means the same as "ru-Cyrl-t-it", i.e.
>> Russian transliterated in the Cyrillic letters of Italian (sic!).
>> And you'll need to use "ru-Latn-t-it" instead, which is compatible (with
>> the "inheritance" mechanism) with "ru-Latn" that indicates a
>> transliteration of Russian to the Latin script (with an unspecified
>> alphabet, the default being according to the Russian standard, even if it
>> requires extended Latin letters not used in Italian).
>> Le 24 février 2012 17:45, Mark Davis ☕ <> a écrit :
>> Just got a new spec published that for the first time gives a standard
>>> way to specify transliterations between languages. So "Italian
>>> transliterated into Russian letters" can be requested or tagged with the
>>> code "ru-t-it". It is now in the standard for identifying languages on the
>>> internet (
>>> [FYI, the page below is an example of how transliteration is used at
>>> Google.]
>>> Greetings from Santa Kurara, Kariforunia - Google Open Source Blog<>
>>> The spec is at, and supported in the latest
>>> version of the Unicode Locale Data specification (LDML,
>>> Thanks to +Yoshito Umaoka<>
>>> , +Addison Phillips <> ,
>>> Courtney Falk, Doug Ewell, +Pete Resnick<>,
>>> and others for helping to put this together!
>>> [Original post:
>>> ------------------------------
>>> Mark <>
>>> *
>>> *
>>> *— Il meglio è l’inimico del bene —*
>>> **
Received on Sun Mar 04 2012 - 17:04:28 CST

This archive was generated by hypermail 2.2.0 : Sun Mar 04 2012 - 17:04:29 CST