I never submitted this as a proposal to Unicode before, but if I were
to, I would suggest the language id scheme that I helped invent when I
worked for MS. [It is documented in Nadine Kano's book, as well as in
the header files for any Win32 development environment.]
The proposed language id is a numeric id, packed into a single short
are 10 bits of primary language and 6 bits (the high bits) of secondary
language id. The rule is that matching primary language id means a
language family where each language can be freely substituted for
another from the same family in user interface, documentation or
helpfiles, yet still be understandable.
The 10th and 16th bit designate an ID as 'user defined'. This allows
researchers or vendors to define IDs for obscure dialects or other
linguistic variations of languages, or to define IDs for rare or
historic languages while preserving the substitution abilities and
allowing data thus tagged to coexist with data using 'standard' (i.e.
Since the tags thus fit into 16-bits, one can play all sorts of games
with how to insert them into a stream of Unicodes. For a pseudo
plain-text approach you could insert ESC <xxx> <yyy> where <xxx> is a
code that designates that this is a language id escape and <yyy> can
immediately be the language id.
[Other suggestions I have heard use the private use space, typically by
reserving 2 sets of 256 codes each of which carries one byte of the
language id in its lower byte. These shave some string length at a cost
of splitting the ids and risking overlap with other uses of P.U. Area.]
Another advantage of the 16-bit key is that it is conveniently useable
as a numeric constant in an API call, without padding or pointer
dereferencing as would be the case for strings of 3 letter
abbreviations or similar schemes.
To summarize: Any proposal needs to address these issues
- how the ID is designed (numeric, string, etc.)
- how one can tell from the id that 2 languages are substitutable
- how the ID is incorportated into a data stream (default protocol)
- suggested initial assignments of ID values
>We are interested in any previous proposals to the Unicode Technical
>Committee with regard to language identifiers.
>If you can provide a copy of any of these types of proposals, we would
>We aren't looking for a language identifier approach, we are checking
>for previous proposals that might overlap or encompass one we might
>Mark Leisher "The trick is not gaining the
>Computing Research Lab but surviving the lessons."
>New Mexico State University -- "Svaha," Charles de
>Box 30001, Dept. 3CRL
>Las Cruces, NM 88003
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT