Re: Indic Devanagari Query

From: Michael Everson (
Date: Wed Jan 29 2003 - 08:11:29 EST

  • Next message: Marco Cimarosti: "RE: Suggestions in Unicode Indic FAQ"

    At 02:13 -0800 2003-01-29, Keyur Shroff wrote:
    >I beg to differ with you on this point. Merely having some provision for
    >composing a character doesn't mean that the character is not a candidate
    >for inclusion as separate code point.

    Yes, it does.

    >India is a big country with millions of people geographically
    >divided and speaking variety of languages. Sentiments are attached
    >with cultures which may vary from one geographical area to another.
    >So when one of the many languages falling under the same script
    >dominate the entire encoding for the script, then other group of
    >people may feel that their language has not been represented
    >properly in the encoding.

    A lot of these "feelings" are simply WRONG, and that has to be faced.
    The syllable KSSA may be treated as a single letter, but this does
    not change the fact that it is a ligature of KA and SSA and that it
    can be represented in Unicode by a string of three characters.

    >In Unicode many characters have been given codepoints regardless of the
    >fact that the same character could have been rendered through some compose
    >mechanism. This includes Indic scripts as well as other scripts. For
    >example, in Devanagari script some code points are allocated to characters
    >(ConsonantNukta) even though the same characters could be produced with
    >combination of the consonant and Nukta.

    There are historical and compatibility reasons that most of this
    stuff, as well as the similar stuff in the Latin range, were encoded.
    At one point some years ago the line was drawn, normalization was
    enacted, and that was that.

    >Also, many times processing of text depends on the smallest addressable
    >unit of that language. Again as discussed in earlier e-mails this may vary
    >from one language to another in the same script. Consider a case when a
    >language processor/application wants to count the number of characters in
    >some text in order to find number of keystrokes required to input the text.

    I can't think of any reason why this would be useful. And what if you
    were not typing, but speaking to your computer? Then there would be
    no keystrokes at all!

    >Further assume that API functions used for this purpose are based on either
    >WChar (wide characters) or UTF-8. In this case it is very much necessary
    >that you assign the character, say Kssha, to the class "consonant". Since
    >assignment to this class "consonant" applies to single code point (the
    >smallest addressable unit) and not to the sequence of codes, it is very
    >much necessary to have single code point for the character "Kssha".

    We are not going to encode KSSA as a single character. It is a
    ligature of KA and SSA, and can already be represented in Unicode.
    You need to handle this "consonant" issue with some other protocol.

    Michael Everson * * Everson Typography *  *

    This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 09:09:37 EST