From: Keyur Shroff (firstname.lastname@example.org)
Date: Wed Jan 29 2003 - 03:44:05 EST
Forgot to reply implementation query. The reply is inline.
--- Aditya Gokhale <email@example.com> wrote:
> 2. Implementation Query -
> In an implementation where I need to send / process Hindi, Marathi
> and Sanskrit data, how do I differentiate between languages (Hindi,
> Marathi and Sanskrit). Say for example, I am writing a translation
> engine, and I want to translate a document having Hindi, Marathi and
> Sanskrit Text in it, how do I know from the code points between 0x0900
> and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
> I would suggest that we should give different code pages for Marathi,
> Hindi and Sanskrit. May be current code page of Devanagari can be traded
> as Hindi and two new code pages for Marathi and Sanskrit be added. This
> could solve these issues. If there is any better way of solving this, any
> one suggest.
Instead of changing/recommending change in an encoding standard, your
problem can best be solved in your application. You can use tags in your
text to specify language. Unicode also facilitates tagging your text but
its use in Unicode is highly discouraged. So you can use some language
similar to xml or html to specify language boundary. Then parse your text,
identify the language boundaries, and do further processing depending upon
If you don't want to use tags in your text then you can predict language by
using some heuristic. This heuristic can be used on some language
properties which may be different for all three languages. In this case
your processing will be divided into two phases. First phase involves
applying some heuristic rule to identify language bounadaries from plain
text and the second is actually processing text for translation. But beware
that the result will not be accurate all the time with such heuristic
processing. Hence use of tags is recommended.
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 04:41:05 EST