Re: Language Tagging And Unicode

From: Richard Gillam (rgillam@jtcsv.com)
Date: Tue Jan 18 2000 - 18:37:50 EST


This whole discussion on Serbian and Russian Cyrillic is getting silly and seems
to be disappearing further and further down a rathole.

I have yet to hear a good reason why the Serbian/Russian problem is anything
more than a font-selection issue. It's the same problem you have with
Greek/Coptic, Arabic/Urdu/Persian, and Traditional Chinese/Simplified
Chinese/Japanese. In *all* of these cases you have characters whose appearance
is determined by the language of the text, even though they're semantically the
same character. In all of these cases, the correct shape to draw is controlled
by some type of out-of-band information and not by Unicode plain text itself.
This is because the only difference is visual presentation, not the semantics of
the character itself. You don't have different character codes for italic or
bold versions of the letter a; so too you don't have different character codes
for the Russian and Serbian italic versions of the letter ghe.

I don't think the idea that "you'll always know whether the text is Russian or
Serbian and therefore it'll only use one or the other set of these characters"
holds much water. The Russian and Serbian te occur in the same place in the
alphabet, have the same pronunciation, and (most importantly) LOOK EXCATLY THE
SAME, except for the small proportion of text that's in italics or a decorative
script font. And I have trouble believing that Russian and Serbian are so
different that the set of sequences of characters that constitute legal words in
both languages is empty (although it might be pretty small), dramatically
complicating searching. To add the characters would require input-method
support that isn't there anywhere right now, and would probably never be
universal, leading to many situations where naive users or programs used the
"Russian" variants for the "Serbian" letters or vice versa.

Of course, this means I have some questions about other "alternative" Cyrillic
characters, such as the Byelorussian i, but the kicker is that this character
looks considerably different from the Russian i in all styles.

Selecting the proper glyph shape is the province of out-of-band information or
markup. This might take the form of language tagging of some kind, but can just
as easily be done by specifying a Russian-specific or Serbian-specific font, or
by enabling or disabling a font feature on a "smart" font that supports both
languages. Nothing more is necessary, and the argument that this prevents both
languages from being used together in the same document is specious.

Remember that the justification for adding these characters is that they look
very different in italic or script versions of the font. Preserving this
distinction in plain text doesn't make any sense-- converting to plain text
would already lose the "italic" or "script" attribute, leading to text that
would always look the same regardless of language.

In other words, the solution to the problem is PICK AN APPROPRIATE FONT,
period. If there isn't adequate font support out there, that's a problem with
the fonts, not with Unicode and not with the display technology.

If this is such an important issue, submit a formal proposal to the UTC. I
gurantee, however, that they won't look kindly on it without much stronger
justification than I've seen to date.

(Please pardon me if my tone here is too harsh. I mean no disrespect; I'm just
a little tired today.)

--Rich Gillam
  Unicode Technology team
  IBM



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT