Re: Language identifier proposals request

From: Olle Jarnefors (ojarnef@admin.kth.se)
Date: Mon Sep 04 1995 - 13:19:38 EDT


Asmus Freytag <asmusf@ix.netcom.com> wrote:

> I should perhaps add another requirement in my list below, that
> proposals need to spell out their relation (mapping or otherwise) to
> exisiting standards.

A very reasonable requirement.

Question 1: Are you who write about language tagging schemes
that are more or less independent of ISO standards really
contemplating setting up other registration mechanisms for
language codes (and country codes), that will compete with the
ISO registration authorities?

> One issue a lot of practitioners have is that
> the rules of the ISO standards have not addressed the issue of
> 'permanency' of tags, this is especially worrysome for the country
> tags. If these tags are to be useful, they need to be aplicable to
> archiving purposes, so once a tag exists, it must exist forever,
> although if a country goes away, new data wouldn't use it any more.

Well, the recent combined Unicode--ISO debacle involving the
removal of 6656 Hangul characters (and soon offering their code
positions for allocation of other characters), despite the
well-understood, unanimous, sincere promise of the standard
fathers/mothers to never move characters, written down in the
standards themselves, shows, I think, that one shall never trust
statements about future permanence of the meaning of tags.

Not that the ISO track record is that bad. No reallocation of
ISO 639 langauge tags has been made. (The standard itself only
empowers the registration authority Infoterm to allocate
_additional_ language symbols, by the way (clause 4.2).) I'm not
aware of any ISO 3166 country code actually having been
reassigned to a new country.

Question 2: Does anybody else have an example of that?

ISO 3166 explicitly allows reassignment after 5 years, however,
and there may certainly be good reasons in future cases for
doing that: Most new countries will want to have a code whose
first letter is equal to the first letter in the name of the
country and whose second letter is a "not insignificant" other
letter of the name. There aren't many codes starting with S left,
for example.

The only workable method to get unique permanent tags that I see
is to be prepared to add a distinguishing suffix, for example
the year of registration, to any tag that is reassigned to a new
entity. In applications where eternal uniqueness is important, a
short tag will then always mean the originally allocated entity.
When a re-allocated entity is implied, the long tag must be
used. In application where eternal uniqueness isn't important,
the short tag may be used in both cases.

Question 3 - for you who propose narrow tag spaces, for example
10 bits for "primary language": Why is this necessary? Why not
consider a less crowded, variable-length scheme like that
defined in RFC 1766?

Only 1024 language tags seems risky to me, considering that
there are more than 6000 described human languages. And I see no
real need to confine all tags to 16 or 32 bits. The "current
language" of a text is typically changed quite infrequently or
not at all. Most people seem content with not having "rich text
attributes" like italics, boldface, superscript, subscript,
halfwidth, fullwidth, proportional, monospaced, red text, blue
text etc. available as single Unicodes. Very compact
representation of language tags, which anyway don't affect the
semantics of a text as much as these attributes can do, seems
even less needed.

> When I worked at language tags at MS, we considered substitution on
> a very narrow basis, counting both Danish and Norwegian as separate
> primary languages.

Substitutability is a very important property, but it is also
quite fuzzy and very dependent on the qualifications of the
individual user. It may also be quite different for the spoken
forms of two languages and for the written forms. For example
two related languages like Danish and Swedish may be
substitutable for most of the inhabitants of both countries for
_written_ text but substitutable only for a rather small part of
the population in at least one of the countries when it comes to
spoken communication. (This is true for Sweden in this case)

Another sad phenomenon is that the question of substitution may
become a highly inflamed political matter, such as in the case
of the recently sharply diverging Serbian and Croatian "languages"
(and the sudden appearance of a third Bosnian "language").

For these reasons I'd like to warn against trying to
"smart-encode" any substitution relationship in the language tags
themselves. A much more robust solution should be to provide
a separate table for the substitution relation(s), that can be
customized by the user.

> For machine provided language processing (spell
> check or grammar check is a standard example) languages need to be
> tagged accurately and one needs to make a distinction between Swiss
> German and German as there are some differences.

Question 4 - to Asmus, just out of curiosity: How do you think
that the ordinary typist entering a multilingual text can be
enticed to input the perhaps burdensome and at least
unnecessary-looking language tags correctly at every point in
the text where the language is changed?

For my own part I have only been able to come up with the idea
that language tags can be automatically input when the typist
switches "logical keyboard layout", for example between an
ordinary English logical keyboard and a Greek logical keyboard.
But this only works when the languages use different scripts,
the distinctions between for example English and Swedish
segments of a text can't be captured in this way.

/Olle

--
Olle Jarnefors, Royal Institute of Technology, Stockholm <ojarnef@admin.kth.se>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT