Re: Indic Devanagari Query

From: Keyur Shroff (
Date: Wed Jan 29 2003 - 05:13:04 EST

  • Next message: Keyur Shroff: "Suggestions in Unicode Indic FAQ"

    --- Asmus Freytag <> wrote:

    > >
    > >All of the above can be composed through following consonant clusters:
    > > jna -> ja halant nya
    > > shra -> sha halant ra
    > > ksh -> ka halant ssha
    > >
    > >The point that the above sequences are considered as characters in some
    > of
    > >the Indian languages has merit. If there is demand from native speakers
    > >then a proposal can be submitted to Unicode. There is a predefined
    > >procedure for proposal submission. Once this is discussed with concerned
    > >people and agreed upon then these ligatures can be added in Devanagari
    > >script itself because Devenagari script represent all three languages
    > you
    > >mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
    > >rules for composing them from the consonant clusters.
    > I wouldn't go so far. The fact that clusters belong together is something
    > that can be handled by the software. Collation and other data processing
    > needs to deal with such issues already for many other languages. See
    > on the collation algorithm.

    I beg to differ with you on this point. Merely having some provision for
    composing a character doesn't mean that the character is not a candidate
    for inclusion as separate code point. India is a big country with millions
    of people geographically divided and speaking variety of languages.
    Sentiments are attached with cultures which may vary from one geographical
    area to another. So when one of the many languages falling under the same
    script dominate the entire encoding for the script, then other group of
    people may feel that their language has not been represented properly in
    the encoding. While Unicode encodes scripts only, the aim was to provide
    sufficient representation to as many languages as possible.

    In Unicode many characters have been given codepoints regardless of the
    fact that the same character could have been rendered through some compose
    mechanism. This includes Indic scripts as well as other scripts. For
    example, in Devanagari script some code points are allocated to characters
    (ConsonantNukta) even though the same characters could be produced with
    combination of the consonant and Nukta. Similarly, in Latin-1 range
    [U+0080-U+00FF] there are few characters which can be produced otherwise.
    That is why the text should be normalized to either pre-composed or
    de-composed character sequence before going for further processing in
    operations like searching and sorting.

    Also, many times processing of text depends on the smallest addressable
    unit of that language. Again as discussed in earlier e-mails this may vary
    from one language to another in the same script. Consider a case when a
    language processor/application wants to count the number of characters in
    some text in order to find number of keystrokes required to input the text.
    Further assume that API functions used for this purpose are based on either
    WChar (wide characters) or UTF-8. In this case it is very much necessary
    that you assign the character, say Kssha, to the class "consonant". Since
    assignment to this class "consonant" applies to single code point (the
    smallest addressable unit) and not to the sequence of codes, it is very
    much necessary to have single code point for the character "Kssha".

    This is my understanding. Please enlighten me if I am wrong.


    Do you Yahoo!?
    Yahoo! Mail Plus - Powerful. Affordable. Sign up now.

    This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 06:15:36 EST