email@example.com (Mark Leisher) writes:
> Adding Embedded Language Identifiers to Unicode Text
> [Proposal to encode language using string of 1 to 16 characters, each
> defining a bit of a 16-bit language number]
I don't think any language numbering scheme will work well, because it
makes it too hard / impossible to add new languages or variants of
No list of languages will contain everything the users will need, so
the only working solution is string-based, i.e. something like the
locale names (e.g. en_us, en_uk, ...) used in many places:
- The basic language (first 2 or 3 letters, 3 need to be allowed too)
are standardized, so you can always get to the major language
- They're easy to extend, as the remaining part can be anything,
including user-defined stuff.
There are many standards about language identification in this way
(e.g. HTML 3) where it has been found a very good and working way to
The only practical way to add custom languages to a enumerated system
is to reserve a range of codes for custom languages, but a unknown
number does not say anything.
A string approach, for example, en_fi still tells you it's english
with probably a Finnish specialty (e.g. currencies or numbers might be
in Finnish format), but you still know it's basically english. When
you get an unknown number, there is nothing you can do.
Every application/system will need a mapping table from the language
numbers to whatever they use internally, which is probably something
locale-style. Having one more mapping table to keep up-to-date is a
Also, your proposal would introduce a strange set of combining
characters to Unicode, whose parsing is different from everything else
and thus would complicate the standard.
Anyway, we have learned from the working and simple internet RFC
standard conventions (and many other places), that it's much better to
use strings to describe things instead of magic numbers, especially
size-limited numbers. Strings are about easy to handle, anyway, and
offer infinite extensibility and can be human readable too.
Even if you wanted to do a language enumeration system, it would be
better to do it using the extension to about 1 million codepoints and
reserve from there a range of codepoints for language IDs.
I agree that we might need some mechanism to indicate language in
Unicode *plain text* files, as everywhere else you already have some
form of tagging on top of plain Unicode, e.g. SGML-based, so you can
use that for language identification too.
But better than some binary enumeration, would be e.g. to define 2
additional characters, LANG_ID_START and LANG_ID_END, and define that
language would be indicated with
<2 or 3 character standard language code, ASCII values only
_ or . (optional)
<1*N character detail code> (optional)
This is not really good either, but it would be a little better than
the proposed numbering approach, I think. It's about as easy to parse,
and doesn't suffer from the limitations of numeric enumeration.
We should not trade a little of programmer convenience for major
long-term limitations in any standard. The programming work will
(should be) done once in the framework or OS libraries, anyway, so it
won't trouble most programmers, anyway.
Most applications will have their own higher-level tagging (SGML-style
markup will probably dominate), so having a different tagging
mechanism for indicating language would require handling two different
kinds of markup.
The really good and desirable approach, I think, would be to raise the
abstraction level of "plain unicode text file", to include
e.g. SGML/HTML style tagging as a standardized part of "plain text
unicode files". Then we could really easily indicate language, and
whatever else you might want, in a standard and easily parsable (and
easily ignorable or removable markup) format.
ASCII-style plain text files deserve to die.
-- Hannu Aronsson <firstname.lastname@example.org> <email@example.com> <firstname.lastname@example.org> Kuusitie 9 A 29, 00270 Helsinki, FINLAND http://www.niksula.cs.hut.fi/~haa/
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT