Re: Language identifier proposals request

From: Asmus Freytag (
Date: Sun Sep 03 1995 - 16:23:50 EDT


I should perhaps add another requirement in my list below, that
proposals need to spell out their relation (mapping or otherwise) to
exisiting standards. One issue a lot of practitioners have is that
the rules of the ISO standards have not addressed the issue of
'permanency' of tags, this is especially worrysome for the country
tags. If these tags are to be useful, they need to be aplicable to
archiving purposes, so once a tag exists, it must exist forever,
although if a country goes away, new data wouldn't use it any more.

When I worked at language tags at MS, we considered substitution on
a very narrow basis, counting both Danish and Norwegian as separate
primary languages. For machine provided language processing (spell
check or grammar check is a standard example) languages need to be
tagged accurately and one needs to make a distinction between Swiss
German and German as there are some differences. The notion of
substitution should just be strong enough to allow you to overcome
this fine grained nature for common tasks (message retrieval), but
more elaborate schemes, such as defaulting sequences may or may not
make the cutoff.

The MS method of message retrieval works like this. It attempts in
- full match
- match on primary language only
- match to 'neutral'
- match to default language
- any language actually present
here, 'neutral' covers the case where a message (or icon) is not
be language specific.


You wrote:
>unicode@Unicode.ORG writes:
>> [elided]
>> Asmus> To summarize: Any proposal needs to address these issues
>> Asmus> - how the ID is designed (numeric, string, etc.)
>> Asmus> - how one can tell from the id that 2 languages are
>> Asmus> - how the ID is incorportated into a data stream (default
>> Asmus> protocol)
>> Asmus> - suggested initial assignments of ID values
>> Asmus, thanks for the description.
>> What was going to be our proposal to the UTC will now simply be
>> submitted to the mailing list after an internal review. Rick
>> kindly reminded me that language identification is not really within
>> the scope of the Unicode Standard.
>> Our approach details three of the four points you mentioned above,
>> but doesn't really discuss "substitutability" per se. That kind of
>> information can easily be encoded in our approach.
>> I can see the neccessity of "substitutablility"; particularly in the
>> context of many commercial systems that provide language support in
>> modular form.
>There is discussion within ISO and CEN on how to do this
>"language substitutability". As presented here it is related
>to what happens when messages for a locale
>is not available, for example when Danish messages are not available
>then a user might want to use the Norwegian messages instead,
>then Swedish, then English and then German. What we are considering
>is using a kind of "locale path" to locate the most suitable
>locale. Danish, Norwegian, Swedish, English and German all
>are part of the same family of languages, but in different
>proximities, and the proximitiy is dependent on the user and
>his/her abilities. But normally each of these languages
>are considered distinct and not just substituteable.
>The language codes from ISO 639 I distributed earlier is normally
>extended with a country code form ISO 3166, so you can get
>"American English" or "British English" by saying resp. en_US or
>en_GB (in POSIX locale notation - a similar notation is being
>proposed in the Internet).
>I think that if Unicode proposes a new standard on this they should
>be aligned as much as possible with ISO standards.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT