Re: Language Tagging And Unicode

From: A. Vine (avine@eng.sun.com)
Date: Wed Jan 19 2000 - 18:48:45 EST


Janko, et al,

Just a repeat of the principles which guided the formation of Unicode (which are
clearly stated in the Unicode 2.0 book):

      16-bit characters
      Full encoding
      Characters, not glyphs
      Semantics
      Plain text
      Logical order
      Unification
      Dynamic composition
      Equivalent sequence
      Convertibility

Under "convertibility" it's important to note that accurate round trip
convertibility is guaranteed between the Unicode Standard and other standards in
wide usage as of May 1993.

Many of the above principles are in conflict. In those cases, the principle
which would enable easier adoption took precedence. One of the reasons some
other "universal" charsets have failed is the lack of taking existing bodies of
data into account. Had Unicode created a character set with many characters
combined or split up such that reliable conversion from existing charsets was
difficult, or such that all conversions required table lookups for every
character, or that all characters were formed via shape combinations, etc., it
would not be as widely available as it is today. Unicode, like so many other
character sets, would have failed.

One of the difficulties with the Han unification, for example, was that the Han
characters are in order based on a series of dictionaries, not existing
character sets. This makes for a more cumbersome (though not impossible)
conversion than, say, the ISO-8859 character sets. For more information on the
principles of Han unification, see Chapter 6, section 6.4, and also Appendix E,
in the Unicode 2.0 book.

It is important to discuss new proposals with an understanding of why things are
the way they are in the current version of Unicode. Referrals to existing
elements in Unicode as an illustration of why something similar should be added
is not enough reason to add it. Consideration for how something got in there in
the first place is important. Most of the "glyph" encodings are in Unicode due
to the round-trip convertibility principle. They are mistakes of the character
sets which were in use as of May 1993, not necessarily Unicode's mistake. But
had Unicode tried to correct them, it would never have been adopted. Without
adoption, all other points are moot.

Andrea

-- 
Andrea Vine, avine@eng.sun.com, Sun-Netscape Alliance i18n architect
"So I just don't see this as an either-or issue as much as an apples 
are yummy, and oranges are yummy, too, issue, and every now and then
fruit salad is tasty." -- Matthew Wall



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT