Re: Mongolian script samples

Date: Thu Oct 09 2008 - 08:32:14 CDT

  • Next message: Andrew West: "Re: Mongolian script samples"


    Unicode is hardly used for
    Mongolian-Uigur script, probably because it's very complicated to
    implement and even to use, and inconsistent with the rest of Unicode.

    For instance, they decided that the initial, middle, final and
    stand-alone forms of a letter will be one character, while in latin and
    Cyrillic (the 2 other scripts I know with such different forms of a
    character), capital and small letters are 2 different characters.

    On the other hand the vowels are considered as different according
    to their... pronunciation! even if they have only one glyph, as if, in
    English, the "i" of "liberty" had another code than the one of "nice".
    Since the pronunciation varies geographically and historically, in a
    few cases, one has to be quite cultivated to know the standard
    historical pronunciation and decide how to write correctly, for
    instance "төвшин"/"түвшин" in Mongolian-Uigur script. If pishing is a
    problem with latin and Cyrillic, it'll be much more with
    Mongolian-Uigur script if used in URLs. Searching for a character
    string inside a text can also be very tricky. Optical character
    recognition needs to work at the word level (that is to have a big
    dictionary), not at the letter level, since the encoding doesn't
    depends only on the letters you see. In a few cases, even the word
    level is not sufficient: you have to understand the meaning of the word
    to write it correctly, such as for "оноо"

    v. "унаа" or "онох" v. "унах", so that character recognition needs
    a process as complicated as... automated translation. And if, rarely,
    both meanings are possible, then you have to make a decision, that is
    to add to the encoding a piece of "information" (restriction of
    meaning) which was not in the text you're encoding.

    The advantage of including pronunciation into the encoding is that automatic reading is eased.

    If we now think that Unicode should be a system to encode what is
    written, no more, then a way to get out of this difficulty would be to
    declare deprecated half the vowels' codes, and keep only one per actual
    form. On the other hand, these vowels have always been pronounced 2
    different ways. The uigur script had never been sufficient at the
    letter level to Mongolian language, a difficulty historically addressed
    by adding artificial spelling differences to many words in order to
    distinguish them, if I understand correctly the process. This
    difficulty is similar, for instance, to the one of old Hebrew where the
    vowels were pronounced but not written. In English also, not only the
    script is not sufficient to know the pronunciation at the letter level,
    but in the case of "read" ("I've read this book.", "I read this
    book."), the word level itself is not enough: you need the grammar.

    So the real question is: "Is Unicode supposed to note the language, or just the script of a language?".

    Yours sincerely.

    Richard Ishida wrote:

    I'm really
    struggling to find any sample text in the Mongolian script in Unicode
    on the Web.  Does anyone have / know of any text they can point me
    towards / send to me, that I'd be able to use for examples. 
    In particular, IE8beta now supports writing-mode:tb-lr, so I want
    to include some real Mongolian script in the tests I am currently
    putting together for vertical script support.

    Also any suggestions for useful fonts would be welcome.  Of the three listed at only Code2000 seems to be doing a reasonable job.




    Richard Ishida

    Internationalization Lead

    W3C (World Wide Web Consortium)

    Henri de Solages' web site:

    This archive was generated by hypermail 2.1.5 : Thu Oct 09 2008 - 08:35:33 CDT