RE: Language Tagging And Unicode

From: Janko Stamenovic (
Date: Thu Jan 20 2000 - 14:34:51 EST

> -----Original Message-----
> From: John Cowan []
> Sent: Thursday, January 20, 2000 7:25 PM
> Michael Everson wrote:
> > Both derive directly from Old Slavonic letter tvrdo.
> That proves too much: they also derive directly from tau,
> as does Latin "t".
> Serbian has, AFAIK, a unique position* among the world's
> written languages: it has two scripts but only one writing
> system (unlike Mongolian or Javanese, where there are two
> completely separate writing systems).

Maybe somebody knows about Moldavian more, they used Cyrillic and now
started to use Latin.

I don't know if they have 1-1 correspondence.

> It is common,
> I am told, for a manuscript to be submitted in Latin
> script even though it is to be printed in Cyrillic, e.g.
> Transliteration is completely mechanical, requiring
> no knowledge of Serbian spelling rules.

Yes! Especially if you submit it in Unicode. :)

Now more seriously -- conversion from Cyrillic to Latin is trivial, to make
it trivial vice versa text in Latin must be written in Unicode or using some
other font where letters "lj" nj" and "dz" (here Z must have caron-- if I
remember Lucida Sans Unicode had that bad) are typed as one character (which
not everybody does) otherwise you'd have to know spelling for some critical
words (we have words of foreign origin like "injekcija" (injection) where nj
should not become one letter). People don't type them that way always
because a) the same Latin characters used for Serbian are used in Croatian,
but since they don't like to print their texts in Cyrillic -- Croatian fonts
don't have these three letters. b) often enough even without them
correspondence can be made unique. As you see in such cases was often
involved 1 to 2 bytes and 2 bytes to 1 conversion(!), again something that
some computer people discovered first only in languages outside Europe.

That's why if you are really interested in languages, scripts, fonts,
Unicode and such you simply have to love Serbian! Just like my dentist
sometimes says "oh, this is *very* interesting" looking at some of my teeth.

And did I mention that the stress marks used in Serbian for analysing the
pronunciation combined with letters also don't have their characters in
Unicode? :)) Ah well, I don't think they must be added to Unicode if they
can be made of combined characters -- but also most of current mainstream
text processing programs do not support the combinations! Here's another
place to test their power. Here I don't claim that these accented characters
are not needed at all, only that they are not used for mainstream
communication, contrary to the letters like t and p!

I'd say that once Serbian is printed and used in modern standards and
software the way it now must be done ad-hoc (custom fonts!) -- this would
mean "a big step for computer world and small for Serbs who in the meantime
still have to print their texts without compatibility for others"!.

> So in the Serbian context it actually makes sense to
> say that U+0411 and U+0042 are mere glyphic variants
> of the same underlying character!\

Yes, I'we been talking about this for a while... Actually I first felt how
this is powerful approach once I saw Greek texts (printed in capitals, not
small letters which are to different) and when I found that I can figure out
what's "semantic of the letter" very fast. Some of them were equal to Latin,
some to Cyrillic and some were unique -- if fit perfectly to what I was used
to. :)

> * I will not enter into the discussion of how many
> languages are named by the labels "Serbian" and "Croatian"
> and other more recently applied names. I am using
> "Serbian" in this posting for convenience.

You are really very smart and good informed!

