Re: Indic Devanagari Query

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Jan 29 2003 - 04:35:45 EST

  • Next message: Aditya Gokhale: "Re: Indic Devanagari Query"

    At 11:54 PM 1/28/03 -0800, Keyur Shroff wrote:
    >--- Aditya Gokhale <aditya@cdacindia.com> wrote:
    >
    > >
    > > 2. Implementation Query -
    > > In an implementation where I need to send / process Hindi, Marathi
    > > and Sanskrit data, how do I differentiate between languages (Hindi,
    > > Marathi and Sanskrit). Say for example, I am writing a translation
    > > engine, and I want to translate a document having Hindi, Marathi and
    > > Sanskrit Text in it, how do I know from the code points between 0x0900
    > > and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
    >
    >Unicode is not divided into code pages. Unlike few old encodings there is
    >only one code page for entire Unicode standard. However, for better
    >readability and quick user reference the entire chart has been divided into
    >different sections which you might interpret as code pages.

    This seems similar to the question, how can one tell from text using
    characters in the ranges 0020-007E and 00A0-00FF whether the text is in
    Danish, German or French?

    It turns out that there are several kinds of approaches. One class of
    approaches looks at the different distribution of letters for the different
    languages. Letter frequency, pair and triplet distribution, and in some
    sense also the 'short word' method are all of this type. For languages that
    use the same script, but that are otherwise not too similar, such methods
    work well.

    Another class of approaches uses unique letters and other unique features
    of each language to make the distinction. Some aspects of the short word
    method could be classed here as well.

    In the case at hand, if all three languages share the same alphabet in
    full, then the first class of methods must be used. Automatic recognition
    of a language can be combined with keeping track of the keyboard layout
    used to type a document and other information about a user's or document's
    context in order to make determination of the language reliable and easy
    for the user.

    > > I would suggest that we should give different code pages for Marathi,
    > > Hindi and Sanskrit. May be current code page of Devanagari can be traded
    > > as Hindi and two new code pages for Marathi and Sanskrit be added. This
    > > could solve these issues. If there is any better way of solving this, any
    > > one suggest.

    There are 6000 languages (or so). If all were written and with an average
    of 100 characters each, encoding each character separately for each
    language would mean 600,000 characters. It is true that many languages are
    not written, but many writing systems used for a variety of languages have
    much more than 100 symbols.

    Allowing each language a duplicate copy of the script it uses means forcing
    everybody to now precisely what language each word is in, otherwise the
    character codes are wrong. And what about quoted foreign words, borrowed
    foreign words, or foreign words that are almost, but not quite assimilated.
    Which 'code pages' would these use?

    Finally, it would then be necessary to cross-correlate these 'code pages'
    for some form or searches. At possibly up to 6,000 of them, there are
    nearly 18 million possible correlation between all these code pages. In
    short, that was the problem that Unicode was invented to solve in the first
    place.

    Sharing a single encoding of a script across all languages that use it, is
    usually not a problem. The technology that can handle the fine details of
    display and other issues that arise from this approach exists (e.g.
    Opentype for fonts and the rendering engines that support the relevant
    features).

    > > 3. Character codes for jna, shra, ksh -
    > >
    > > In Sanskrit and Marathi jna, shra and ksh are considered as separate
    > > characters and not ligatures. How do we take care of this ? Can I get
    > > over all views on the matter from the group ? In my opinion they should
    > > be given different code points in the specific language code page.
    > > Please find below the character glyphs -
    > >
    > > jna
    > > shra
    > > ksh
    >
    >All of the above can be composed through following consonant clusters:
    > jna -> ja halant nya
    > shra -> sha halant ra
    > ksh -> ka halant ssha
    >
    >The point that the above sequences are considered as characters in some of
    >the Indian languages has merit. If there is demand from native speakers
    >then a proposal can be submitted to Unicode. There is a predefined
    >procedure for proposal submission. Once this is discussed with concerned
    >people and agreed upon then these ligatures can be added in Devanagari
    >script itself because Devenagari script represent all three languages you
    >mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
    >rules for composing them from the consonant clusters.

    I wouldn't go so far. The fact that clusters belong together is something
    that can be handled by the software. Collation and other data processing
    needs to deal with such issues already for many other languages. See
    http://www.unicode.org/reports/tr10 on the collation algorithm.

    A./



    This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 05:05:56 EST