Mongolian-Uigur encoding model

From: mongolie2006-unicode@yahoo.fr
Date: Fri Oct 10 2008 - 11:23:09 CDT

  • Next message: Joó Ádám: "Re: Mongolian script samples"

    Edward Cherlin, you misunderstood my second point, partly because of my readaction, partly because you probably didn't read the ch. 13 "Additional modern scripts" Unicode 5.0 Book (Unicode has a very broad notion of "modernity" indeed!).

    You pretended that: "The same principles are used for encoding characters in all languages.", but the chapter 13 itself admits that:
    "Encoding Principles. The encoding model for Mongolian is somewhat different from that for any other script within Unicode, and in many respects it is the most complicated."

    I had written, as a concession:
    >> The advantage of including pronunciation into the encoding is that automatic
    >> reading is eased.
    and you answered
    > Won't happen, can't happen, unless Mongols decide to write their
    > language differently, and come up with a new standard orthography

    I meant that, whether you like it or not, the pronunciation IS presently part of the definition of Mongolian characters in Unicode. Ch. 13 explains:
    "The Semitic alphabet from which the Mongolian script was ultimately derived is fundamentally inadequate for representing the sounds of the Mongolian language. As a result, many of the Mongolian letters are used to represent two different sounds, and the correct pronunciation of a letter may be known only from the context. In this respect, Mongolian orthography is similar to English spelling, in which the pronunciation of a letter such as c may be known only from the context.
    Unlike in the Latin script, in which c /k/ and c /s/ are treated as the same letter and encoded as a single character, in the Mongolian script different phonetic values of the same glyph may be encoded as distinct characters. Modern Mongolian grammars consider the phonetic value of a letter to be its distinguishing feature, rather than its glyph shape." (end of quotation).

    And what you considered in my post as "language issues, not Unicode issues" are proper Unicode issues. If I redact a bit more precisely, it becomes:
    Since the pronunciation varies geographically and historically, in a few cases, one has to be quite cultivated to know the standard historical pronunciation and decide how to ENCODE correctly, for instance "төвшин"/"түвшин" in Mongolian-Uigur script's Unicode encoding. If pishing is a problem with latin and Cyrillic, it'll be much more with Mongolian-Uigur script if used in URLs. Searching for a character string inside a text can also be very tricky. Optical character recognition needs to work at the word level (that is to have a big dictionary), not at the letter level, since the encoding doesn't depends only on the letters you see. In a few cases, even the word level is not sufficient: you have to understand the meaning of the word to ENCODE it correctly, such as for "оноо" v. "унаа" or "онох" v. "унах", so that O.C.R. needs a process as complicated as... automated translation.

    This is why I suggested, as a possibility, that half vowels codes would be declared as deprecated, i.e. that the pronunciation would not be taken into account anymore.

    To my question "Is Unicode supposed to note the language, or just the script of a language?", it's easy to answer as you did: "Just the characters.", but ch. 2 of Unicode 5.0 book explains better than I would do how difficult it is to define what a "character" should be, and the "trade-offs" needed in the process of listing characters.
    They propose:
    "Characters are the abstract representations of the smallest components of written language that have semantic value.". So the pronunciation should not be taken into account. Or you could consider latin "a", Cyrillic "а" and Mongolian-Uigur "a" as only glyphs of a single character, since they are just different forms of one smallest component of the unique Mongolian language having semantic value. Mongolian language has historically been written in 10 different alphabets, and is presently written every day with these 3. The situation is even complicated by the fact that in real life Cyrillic and latin scripts are mixed even inside a sentence to express proper names, so that one could considered latin and Cyrillic together as one script. If considered as 2 different scripts, then there is no reason (as far as Mongolian is concerned) to regard the Mongolian-Uigur alphabet for Mongolian language as the same script as the Mongolian-Uigur alphabet used to
     write foreign words (sometimes called Mongolian-Uigur
    "sub-alphabet" by scholars). This sub-alphabet has proper letters, not part of the alphabet used to write proper Mongolian names, but many letters are common. Now, in writing proper Mongolian words, the 3 forms of the "t"/"d" are just glyphs (initial, middle, final forms) of one character. But in writing foreign words, 2 of these very forms have different semantic value, used to distinguish "t" and "d". So, according to the definition of "characters", they should be given 3 different codes (one for the main alphabet, 2 for the sub-alphabet). It's not the case.

    The situation is very complicated, so I understand non totally satisfactory decisions has been taken. But I question that solution, because, taking into consideration the pronounciation,
    1) it's very different from the other scripts' encoding principles,
    2) it's very far from what is written. This eases 2 processes (automatic reading and automatic translitteration) but make other ones more complicated (string search, anti-pishing security process) and, which is worse, sometimes (rarely) arbitrary: OCR, typewriting. In these 2 processes, we reach levels of complexity far higher that the ones of the "Basic text processes" listed in ch. 2 of the Unicode book.



    This archive was generated by hypermail 2.1.5 : Fri Oct 10 2008 - 11:27:56 CDT