Tagging orthographic systems (was: (iso639.186) the Ethnologue)

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Wed Sep 13 2000 - 18:18:34 EDT

Am 2000-09-12 um 17:43 h UCT hat Peter Constable geschrieben:
> ISO 639 codes were primarily intended for bibliography purposes.
> Gary and I point out in our paper that the needs of that sector do
> not necessarily correspond to the general needs of IT, particularly
> for language-specific processing. [...] For example, if all you know
> about the language of some information object is that it is an Athapascan
> language, you can't spell-check that information. The intro to ISO 639
> claims that the standard is intending to serve the needs of a variety of
> sectors, but in its current state it is failing to adequately serve some.
> Furthermore, we would contend that the categories enumerated in the
> Ethnologue by-and-large *are* the categories that need to be identified for
> general IT purposes. In the majority of cases, the distinctions made are
> those that would be needed to successfully spell-check, for example. (We
> acknowledge that that is not true in all cases; for example, Chinese
> spelling would cross multiple languages; and alternate English spellings
> are needed for what would generally be considered one language. But these
> are the exceptions, not the norm.)

For many language-specific IT processes involving written language,
such as spell-checking, hyphenating, transliterating (e. g. to Braille),
or audible rendering, it is not enough to know which language you are
dealing with: you also need information about the orthography used.

Orthography is subject to change over time, sometimes several orthograhies
for the same language co-exist, e. g. in transition time-spans or in
neighbouring countries.

For example,
- German orthography has been reformed in 1996; currently, two ortho-
  graphies are legal (e. g. accepted in school assignments): the old
  one, established in 1902, until 2005-07-31, and the new one, effective
  since 1998-08-01; cf. (in German)
  <http://www.ids-mannheim.de/reform/zeitafel.html> (time schedule),
  <http://www.ids-mannheim.de/pub/sprachreport/sr98-extra.pdf> (tutorial),
  and <http://www.ids-mannheim.de/grammis/reform/inhalt.html> (rules);
- France had an orthographic reform for French, in 1991;
- the Dutch spelling-reform of 1934 was enacted 1943 in Belgium,
  and 1947 in the Netherlands; Dutsch spelling was again (marginally)
  reformed in 1995, effective since 1996-08-01;
- Norwegian spelling was reformed in 1907, 1917, and 1938;
- Danish in 1948;
- Spanish in 1910, and again in 1852/55;
- Greek in 1982;
to name just a few. The co-existence of en_US and en_UK has already been
mentioned, im this thread.

Hence, I plead for a tagging-system that allows to represent these dif-
ferences. Currently, all of my WWW pages contain the line:
  <HTML LANG=de><!--neue Rechtschreibung-->
I would rather prefer to incorporate the comment in the tag, as in
the hypothetical:
  <HTML LANG=de-sp1996>
and likewise for other languages, and other applications.

Note that this issue is orthogonal to the country code of RFC 1766.
E. g., both de-AT, de-CH and de-DE could be either spelled the 1902,
or the 1996, way. Hence, the spelling subtag, and the country subtag
should be optional, independend of each other.

I think, the ethnologue lacks information about variant orthographies.
(I last looked in it, a couple of months ago.) Both RFC 1766 and
ISO 639 ignore the issue of variant orthographies.

Best wishes,
   Otto Stolz

