From: Keyur Shroff (firstname.lastname@example.org)
Date: Wed Jan 29 2003 - 05:13:04 EST
--- Asmus Freytag <email@example.com> wrote:
> >All of the above can be composed through following consonant clusters:
> > jna -> ja halant nya
> > shra -> sha halant ra
> > ksh -> ka halant ssha
> >The point that the above sequences are considered as characters in some
> >the Indian languages has merit. If there is demand from native speakers
> >then a proposal can be submitted to Unicode. There is a predefined
> >procedure for proposal submission. Once this is discussed with concerned
> >people and agreed upon then these ligatures can be added in Devanagari
> >script itself because Devenagari script represent all three languages
> >mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
> >rules for composing them from the consonant clusters.
> I wouldn't go so far. The fact that clusters belong together is something
> that can be handled by the software. Collation and other data processing
> needs to deal with such issues already for many other languages. See
> http://www.unicode.org/reports/tr10 on the collation algorithm.
I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point. India is a big country with millions
of people geographically divided and speaking variety of languages.
Sentiments are attached with cultures which may vary from one geographical
area to another. So when one of the many languages falling under the same
script dominate the entire encoding for the script, then other group of
people may feel that their language has not been represented properly in
the encoding. While Unicode encodes scripts only, the aim was to provide
sufficient representation to as many languages as possible.
In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta. Similarly, in Latin-1 range
[U+0080-U+00FF] there are few characters which can be produced otherwise.
That is why the text should be normalized to either pre-composed or
de-composed character sequence before going for further processing in
operations like searching and sorting.
Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.
Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class "consonant". Since
assignment to this class "consonant" applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character "Kssha".
This is my understanding. Please enlighten me if I am wrong.
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
This archive was generated by hypermail 2.1.5 : Wed Jan 29 2003 - 06:15:36 EST