From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Jan 29 2003 - 04:35:45 EST
At 11:54 PM 1/28/03 -0800, Keyur Shroff wrote:
>--- Aditya Gokhale <aditya@cdacindia.com> wrote:
>
> >
> > 2. Implementation Query -
> > In an implementation where I need to send / process Hindi, Marathi
> > and Sanskrit data, how do I differentiate between languages (Hindi,
> > Marathi and Sanskrit). Say for example, I am writing a translation
> > engine, and I want to translate a document having Hindi, Marathi and
> > Sanskrit Text in it, how do I know from the code points between 0x0900
> > and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
>
>Unicode is not divided into code pages. Unlike few old encodings there is
>only one code page for entire Unicode standard. However, for better
>readability and quick user reference the entire chart has been divided into
>different sections which you might interpret as code pages.
This seems similar to the question, how can one tell from text using
characters in the ranges 0020-007E and 00A0-00FF whether the text is in
Danish, German or French?
It turns out that there are several kinds of approaches. One class of
approaches looks at the different distribution of letters for the different
languages. Letter frequency, pair and triplet distribution, and in some
sense also the 'short word' method are all of this type. For languages that
use the same script, but that are otherwise not too similar, such methods
work well.
Another class of approaches uses unique letters and other unique features
of each language to make the distinction. Some aspects of the short word
method could be classed here as well.
In the case at hand, if all three languages share the same alphabet in
full, then the first class of methods must be used. Automatic recognition
of a language can be combined with keeping track of the keyboard layout
used to type a document and other information about a user's or document's
context in order to make determination of the language reliable and easy
for the user.
> > I would suggest that we should give different code pages for Marathi,
> > Hindi and Sanskrit. May be current code page of Devanagari can be traded
> > as Hindi and two new code pages for Marathi and Sanskrit be added. This
> > could solve these issues. If there is any better way of solving this, any
> > one suggest.
There are 6000 languages (or so). If all were written and with an average
of 100 characters each, encoding each character separately for each
language would mean 600,000 characters. It is true that many languages are
not written, but many writing systems used for a variety of languages have
much more than 100 symbols.
Allowing each language a duplicate copy of the script it uses means forcing
everybody to now precisely what language each word is in, otherwise the
character codes are wrong. And what about quoted foreign words, borrowed
foreign words, or foreign words that are almost, but not quite assimilated.
Which 'code pages' would these use?
Finally, it would then be necessary to cross-correlate these 'code pages'
for some form or searches. At possibly up to 6,000 of them, there are
nearly 18 million possible correlation between all these code pages. In
short, that was the problem that Unicode was invented to solve in the first
place.
Sharing a single encoding of a script across all languages that use it, is
usually not a problem. The technology that can handle the fine details of
display and other issues that arise from this approach exists (e.g.
Opentype for fonts and the rendering engines that support the relevant
features).
> > 3. Character codes for jna, shra, ksh -
> >
> > In Sanskrit and Marathi jna, shra and ksh are considered as separate
> > characters and not ligatures. How do we take care of this ? Can I get
> > over all views on the matter from the group ? In my opinion they should
> > be given different code points in the specific language code page.
> > Please find below the character glyphs -
> >
> > jna
> > shra
> > ksh
>
>All of the above can be composed through following consonant clusters:
> jna -> ja halant nya
> shra -> sha halant ra
> ksh -> ka halant ssha
>
>The point that the above sequences are considered as characters in some of
>the Indian languages has merit. If there is demand from native speakers
>then a proposal can be submitted to Unicode. There is a predefined
>procedure for proposal submission. Once this is discussed with concerned
>people and agreed upon then these ligatures can be added in Devanagari
>script itself because Devenagari script represent all three languages you
>mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
>rules for composing them from the consonant clusters.
I wouldn't go so far. The fact that clusters belong together is something
that can be handled by the software. Collation and other data processing
needs to deal with such issues already for many other languages. See
http://www.unicode.org/reports/tr10 on the collation algorithm.
A./
This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 05:05:56 EST