Date: Thu Oct 09 2008 - 08:32:14 CDT
Unicode is hardly used for
Mongolian-Uigur script, probably because it's very complicated to
implement and even to use, and inconsistent with the rest of Unicode.
For instance, they decided that the initial, middle, final and
stand-alone forms of a letter will be one character, while in latin and
Cyrillic (the 2 other scripts I know with such different forms of a
character), capital and small letters are 2 different characters.
On the other hand the vowels are considered as different according
to their... pronunciation! even if they have only one glyph, as if, in
English, the "i" of "liberty" had another code than the one of "nice".
Since the pronunciation varies geographically and historically, in a
few cases, one has to be quite cultivated to know the standard
historical pronunciation and decide how to write correctly, for
instance "төвшин"/"түвшин" in Mongolian-Uigur script. If pishing is a
problem with latin and Cyrillic, it'll be much more with
Mongolian-Uigur script if used in URLs. Searching for a character
string inside a text can also be very tricky. Optical character
recognition needs to work at the word level (that is to have a big
dictionary), not at the letter level, since the encoding doesn't
depends only on the letters you see. In a few cases, even the word
level is not sufficient: you have to understand the meaning of the word
to write it correctly, such as for "оноо"
v. "унаа" or "онох" v. "унах", so that character recognition needs
a process as complicated as... automated translation. And if, rarely,
both meanings are possible, then you have to make a decision, that is
to add to the encoding a piece of "information" (restriction of
meaning) which was not in the text you're encoding.
The advantage of including pronunciation into the encoding is that automatic reading is eased.
If we now think that Unicode should be a system to encode what is
written, no more, then a way to get out of this difficulty would be to
declare deprecated half the vowels' codes, and keep only one per actual
form. On the other hand, these vowels have always been pronounced 2
different ways. The uigur script had never been sufficient at the
letter level to Mongolian language, a difficulty historically addressed
by adding artificial spelling differences to many words in order to
distinguish them, if I understand correctly the process. This
difficulty is similar, for instance, to the one of old Hebrew where the
vowels were pronounced but not written. In English also, not only the
script is not sufficient to know the pronunciation at the letter level,
but in the case of "read" ("I've read this book.", "I read this
book."), the word level itself is not enough: you need the grammar.
So the real question is: "Is Unicode supposed to note the language, or just the script of a language?".
Richard Ishida wrote:
struggling to find any sample text in the Mongolian script in Unicode
on the Web. Does anyone have / know of any text they can point me
towards / send to me, that I'd be able to use for examples.
In particular, IE8beta now supports writing-mode:tb-lr, so I want
to include some real Mongolian script in the tests I am currently
putting together for vertical script support.
Also any suggestions for useful fonts would be welcome. Of the three listed at http://www.wazu.jp/gallery/Fonts_Mongolian.html only Code2000 seems to be doing a reasonable job.
W3C (World Wide Web Consortium)
--- Henri de Solages' web site: http://Solages.site.voila.fr/index_en.html
This archive was generated by hypermail 2.1.5 : Thu Oct 09 2008 - 08:35:33 CDT