From: Richard Wordingham (firstname.lastname@example.org)
Date: Sun Jun 26 2005 - 06:05:31 CDT
Sinnathurai Srivas wrote:
>>> For example Tamil K will indicate k, h, g, q, x and other related
>>> while Devanagari would have individual character shapes representing
>>> individual phonemes. Tamil is based on Alphabet based phonemic system,
>>> while Devanagari is based on phonemic system.
>> I think you mean that Tamil spelling uses digraphs for consonants while
>> Devanagari uses single letters. Unless the Tamil digraphs are sorted like
>> single letters, this happens to be irrelevant for Unicode.
> No if by digraphs, you mean
Do you just mean then that Tamil orthography is ambiguous?
> each alphabet represent some related phonemes.
Vocabulary note: Unlike Indian-based languages, 'alphabet' means the whole
system, not an individual _letters_.
>>> If Unicode changes it's policy from the unimportant and non functioning
>>> transliteration based encoding to one of natural sorting based encoding
>>> would be a superior solution. However, expecting Unicode to change it's
>>> encoding philosophy of ISCII based transliteration encoding to one of
>>> natural sorting based encoding is not going to be easy.
>> You may care to view the UCA weights as a temporary conversion to a
>> sorting-based encoding.
> Can you give some pointers.
I hope you have read the Unicode Collation Algorithm (
http://www.unicode.org/reports/tr10/ ). It proceeds in four main steps
Step 1: Convert to Normal Form Decomposed (NFD) - probably not needed for
Tamil - See Section 7.2 of UCA.
Step 2: Look up the sequence of 'weights'.
Step 3: Form the sort key from the weights.
(Step 4: Use the sort keys like any other sorting algorithm.)
The 'Level 1' part of the weights is what I was suggesting be thought of as
a sorting-based encoding. For example, consider the ASCII characters 'B',
'C', 'b' and 'c' and the latest set of weights (in
'b' U+0062 Level 1 weight 0F85 Level 2 weight 0020 Level 3 weight
'B' U+0042 Level 1 weight 0F85 Level 2 weight 0020 Level 3 weight
'c' U+0063 Level 1 weight 0F9D Level 2 weight 0020 Level 3 weight
'C' U+0043 Level 1 weight 0F9D Level 2 weight 0020 Level 3 weight
The combination of weights is chosen so that 'b' and 'B' both come before
'c' and 'C', even though their binary Unicode encodings would give the order
'B', 'C', 'b', 'c'. The Level 3 weights differ so that although 'bc' comes
before 'Bc', 'Bb' comes before 'bc'. This is a complication that does not
exist in Tamil.
>>> We will need to work on what is imposed on Tamil and find software
>>> solutions to resolve sorting requirements.
>> If Tamil sorting can be expressed purely by a sorting order of consonants
>> and vowels, then the answer for sorting words is simply to rearrange the
>> weights on vowels and letters in the default UCA to accord with this
> 99% yes.
> Simply, the pulli (virama!), the dependent vowels, vowels and Aytham need
> to be weighted and that's it.
> However, by Grammar, because of puLLi/virama there should not be conjuncts
> in Tamil. However Unicode has decided Tamil has one conjunct. (Not
> hundreds but one). Instead if treating the Grantham ksh as x, Unicode
> insists ksh is a conjunct. There is no other complications. So we may need
> to spend vast amount of mony to fix this insistance by Unicode, does not
> matter if only one or a thosand Tamil has a conjunct in the form of ksh
> and if collation need to be implemented as in Tamil design, Tamil need to
> accept Unicode design and work with it.
This is not a big problem. In the look-up table of weights, one simply
inserts an entry for 'ksh' (a 'contraction' - Section 126.96.36.199). See
discussion of VOWEL SIGN O below.
> There are double encodings of some phenominan. Unicode violated it's own
> policy of standardising language by double encoding in the name of
> canonisim. This is also violation of Unicode architecture, wher by it
> violates linear and ligature philosophy by mis understanding canonism. see
> http://www.geocities.com/avarangal/rfc/RFC-TA-content_Tamil.html This
> unwanted inclusion may cut the 99% simple algorithm to about 80% simple
> plus 20% extremly complicated and back breaking algorithm, that might
> cause problem for a long time to come.
The default weights already address this. The current weight entries for
VOWEL SIGN O and its decomposition are given in the table by:
0BCA ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
0BC6 0BBE ; [.197B.0020.0002.0BCA] # TAMIL VOWEL SIGN O
Note that the sorting algorithm will treat them as identical.
A similar entry for 'ksh' would start '0B95 0BCD 0BB7'.
I'm not sure these canonical decompositions are breaches of architecture any
more than other canonical expansions. I can't get up worked about this
issue because for Thai, for example, only the decomposed form is available.
This archive was generated by hypermail 2.1.5 : Sun Jun 26 2005 - 06:07:50 CDT