Language Tagging And Unicode

From: Janko Stamenovic (janko@teletrader.com)
Date: Tue Jan 18 2000 - 07:35:14 EST


Information I received from Doug Ewell [mailto:dewell@compuserve.com] in
topic "Unicode Cyrillic GHE DE PE TE in Serbian" inspired me for a lot of
new questions which I guess deserve a new topic.

In short, the goal was the simple: finding standard solution to properly
display Russian and Serbian Cyrillic text. Most participants in this
discussion argued that this should not be solved by adding new Unicode
characters, claiming that applications must be able to recognize if specific
Unicode coded text is actually Russian or if it is Serbian.

As Doug Ewell pointed:

> After all, one of the most highly touted ideals of Unicode is
> to make it possible to enter, process, and display different scripts
> in one document without special "language packs" or different OSes.
> Russian Cyrillic and Serb Cyrillic ought to be no exception.

So we must see what was proposed as the solution instead of adding "Serbian"
characters: awareness of display engines, operating systems and applications
of the language tags.

As most people pointed out, there is a standardized way to tag the text in
HTML. In general, tagging is not something that can be standardized
independent of knowing what's being tagged.

If I understand properly, the "Plane 14 technique" proposes the method for
tagging Unicode text. However the way it is proposed, it does not represent
anything except the possibility to be used -- it is explicitly said that
"most of the times it will not be used because text will be tagged the way
the containing document handles it".

This raises very big question: if most of the times the tagging is not going
to be used as the part of the Unicode text, how can the goal of having
proper Russian and Serbian text be solved (OK, if you don't like to read
this specific, imagine that I'm trying to solve the things in more general
way -- what I'm doing it is that I'm talking about the particular problem
because it is a problem which has it's own weight, and no matter how much we
can philosophy about "general solutions" we always make solutions because we
know about the real problems which we want to solve).

Now I'm going to come down to real world, the way people who do programming
for living see it. I'm going to use terms of Windows32 API but you can
safely imagine your favorite operating system/API. Windows32 API is more
than good example since it is definitely most widely used Unicode capable
system in the world.

If we are to render the fonts differently for different LANGUAGES, using
information from external tags and do not expect that Unicode can solve our
problems, there are we again were we were once before the Unicode.

With all due respect, "Plane 14 Language Tagging" as far as I can see it
from the proposal is just "the possibility to put tags in Unicode text" and
not something that will be really used through the applications/operating
systems.

OK, let's break down the problem to two issues:

1) External tagging

Here we would need:

- Extension of API to pass the language information during the font
creation.

- Application must create/use the new font each time the language tag is
changed.

- The font engine must be changed to associate different glyphs for the same
characters, based on the language information.

- I know that OpenType will contain language information, what I want to
know is do anybody know how it will be used in practice? I don't have the
chance to do this using current API?

That is, among the following charsets:

ANSI_CHARSET
BALTIC_CHARSET
CHINESEBIG5_CHARSET
DEFAULT_CHARSET
EASTEUROPE_CHARSET
GB2312_CHARSET
GREEK_CHARSET
HANGUL_CHARSET
MAC_CHARSET
OEM_CHARSET
RUSSIAN_CHARSET
SHIFTJIS_CHARSET
SYMBOL_CHARSET
TURKISH_CHARSET

Will here be added "SERBIAN_CHARSET" only because the difference in five
letters from Russian? Hmm...

2) Internal (plane 14) tagging

Here we would need:

1) Change of the applications to actually form additional Unicode sequences
based on application language tags (e.g. HTML tags) which will be sent to
the font engine even if they are not going to be printed just to inform the
font engines that the language was changed.

2) Change of the font engine to recognize plane 14 tags as "table switching"
commands (to use different glyphs for the same Unicode characters).

As I see in the moment, Unicode does not intend to standardize either 1) or
2) as something that must be supported by applications. And I can really
admit that this would be too big request.

Is there anybody who reads this who actually implements Unicode support in
real life operating systems?

I want also the opinion of the people who are involved in Unicode
standardization:

- Why is it so bad idea to have five new characters for Serbian Cyrillic in
Unicode anyway? Which negative effects actually this would really make? As
we saw, different glyphs must exist. So the fonts/tables would not get any
smaller nevertheless -- in fact even now when one glyph is the same for more
characters, it appears in the font definition only once. For this particular
problem I don't see that the acceptance of this would make any domino
effect.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT