John Hudson writes:
> Yes, this is another type of information that could usefully be tagged but
> which is ignored by existing standards, and non-standards, of 'language
> codes' (whatever the hell one of those is). Would it be too radical to
> suggest that 'language codes', per se, are one of the least useful things
> for IT tagging? A blind code, that offers no information about orthography,
> script variant, or even whether a language is written at all, simply does
> not convey enough information by itself. To be useful at all it needs to be
> combined with other codes that indicate combinations of script, language
> and orthography.

One (well, the only) problem I have with explicit orthographic tagging
is that it makes assumptions that a consistent orthography is being
used throughout a document, which isn't necessarily the case. This is
particularly prevalent in East Asian languages:

Japanese verbs will have a standard form, along with several possible
okurigana variants as well as possible use of hiragana instead of
kanji. Consider a literal translation of "A hen that lays golden eggs"
--- 'kin no tomago wa umu niwatori'. There are 24 different ways one
could write this, all valid.

Orthographic variation is rampant on Taiwan as well: more and more you
are seeing simplified forms being used instead of the "correct"
traditional forms. While this has often been true in hand written
communication, one sees it now online as well.

I would think that even within European languages it is possible to
find texts which use a mixture of orthographies.


