Re: About names again...

From: John Cowan (cowan@locke.ccil.org)
Date: Tue Jan 06 1998 - 13:34:36 EST


I am responding to all recipients mentioned in the text, as well as the
Unicode and ISO 15924 mailing lists. Apologies to those who get this
message more than once.

Alain LaBonté wrote:

> Did
> anybody ever complain about using phone numbers or catalog numbers to order
> goods?

Everyone who has ever made or received a phone call to the wrong number,
or who has received the wrong product, has implicitly recognized the
problems associated with meaningless codes. See my peroration at the end.

Jake Knoppers writes:

> In CAW N021, I state the case for using the 3 digit numeric as the common
> interchange value for country codes among application interfaces since it
> is (1) the most stable, i.e. it does not change unless the physical entity
> referenced changes unlike the two and three alpha codes ;

Countries are not "physical entities"; their whole existence is juristic
(granted that some of the countries in ISO 3166 are de facto). When a
country changes its name, it typically also novates its relationships
with surrounding countries.

> Further, I argued that the three alpha code be reserved for CUC since it is
> its primary reference and unique refrence for codes respresenting
> currencies and funds.

It has been said that the worst problem facing computing in the years to
come is that there are only 26^3 = 17576 three-letter acronyms. I hardly
think that there will be much confusion between ENG and GBP.
 
> (i)the ISO 639 convert to a four-digit numeric as the primary and
> unambiguous identification for languages codes
>
> (ii) that ISO 10646 serve as the source of the repetoire of any combination
> of characters/symbols, etc. i.e. as a set for referencing ISO 639 language
> codes

I do not understand how a character repertoire can "reference" a language
code. Languages, and also language codes, are written using a repertoire
of characters: the repertoire for the codes is LATIN CAPITAL LETTER A through
Z, or under this proposal DIGIT ZERO through NINE.

> (iii) that the 0000-1999 ISO 639 language codes series be reserved for
> those languages which respresent those character sets/symbols/notations

Now we have languages which "represent" character sets (presumably
repertoires). I suppose what is meant by this pseudo-standardese is
that to each language there corresponds a character repertoire, or
perhaps more than one if the language is written with more than one
script or script variant (e.g. Serbian, Mongolian, Chinese).

> ,i.e. as an officially approved user profile of ISO 10646, officially
> recognized for use by countries within their physical boundaries as
> represented by/linked to the three-digit numeric country codes as found in
> ISO-3166 [COC] as well as needed characters/symbols for the associated
> applicable currencies as found in ISO 4127[CUC]

Countries have many languages: the languages widely used within a country
may not be official. This proposal would recognize English within Cameroon
and Canada (where it is official), but not in the U.S.A, the U.K, or Australia
(where it is not official). Furthermore, no such "officially approved
user profile[s] of ISO 10646" exist, or are likely to exist, for more than
a fraction of the world's official languages. If the U.S. were to
promulgate such a profile for the purpose of printing Government documents,
it would have to handle (at a minimum) English, French, Spanish, and
traditional-orthography Chinese, none of which is official.
 
> (iv) that the 2000-3999 ISO language code series block be reserved for
> "languages" to be reigistered by linguists (via their professional
> associations) for languages and associated character sets which either are
> (1) considered "dialects" from a 0000-1999 perspective; and/or (2) no
> longer in use but once, in the past. forming part of languages/dialects
> which while not meeting the criteria of the 0000-1999 series are written
> languages the mapping of which can be supported through ISO 10464

The implicit trichotomy of natural languages into official languages,
supposed dialects thereof, and extinct languages does not come close to
representing the diversity of the world's languages, even excluding those
(such as Burushaski, spoken in Pakistan and not a "dialect" of, or even
related, to any other language) which are at present unwritten. Furthermore,
"unwritten" is an unstable condition; languages are being reduced to
writing constantly.
 
> (v) the the 4000-5999 ISO language code series block numeric be reserved
> for representing mapping/user profiles of ISO /IEC 10464 in support of
> formal scientific and technical languages, i.e. special languages according
> to TC37, via their (internationally recognized) professional associations.

No useful purpose is served by aggregating technical "languages" with
natural ones. We do not see translations from English into C, or from
Prolog into Polish; the question of whether a document is a novel or
an operating-system listing is not well-served by examining a language
code. (Unless by "special languages" is meant restricted-vocabulary
versions of natural languages such as Xerox Multinational English,
in which case I agree that codes for them would be useful.)
 
> (vii) the 9000-9999 ISO language code series block be resaerved for user
> extensions usch as "pig-latin, klingdon, esperanto, etc. etc.

Any such code should have a private zone, so I have no trouble endorsing
this idea. Esperanto in particular, however, is hardly equivalent to
Klingon or Pig-Latin: it has a substantial original literature and
several thousand native speakers, plus a representation in ISO 8859-3
(and a fortiori ISO 10646).
 
> What do you say? What do you think?

I think that although meaningful codes cannot be instantly meaningful to
everyone, they are far superior to meaningless codes (whether numeric or
letteral makes no difference). It requires some additional knowledge for an
anglophone to realize why DE=Germany and CH=Switzerland, but how
much worse if we all, not merely anglophones, had to memorize the ISO 3166
equivalent codes 276 and 756 respectively! It is simply not true that such
codes are solely the domain of machines: they leak into the Real World,
as anyone with an email address outside the U.S. can plainly see.

In particular, HTML 3.2 introduces the LANG attribute, which can be specified
on almost every HTML tag, and associates an (ISO 639) language tag,
possibly qualified by a country tag, with the associated span of text.
It is fairly easy to remember to tag my text with LANG=en, but quite
unreasonable to expect LANG=1248, and it is not the case that all
(or most) HTML is written using a specialized tool.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
			e'osai ko sarji la lojban



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT