Indian scripts and Unicode

From: Jeroen Hellingman (etmjehe@genesis.etm.ericsson.se)
Date: Wed Sep 03 1997 - 06:56:05 EDT


For your information and comments. All remarks are welcome.

=====================================

Indian scripts and Unicode
==========================

Jeroen Hellingman

History:
        29-AUG-1997 Removed comments on Unicode 1.1 now resolved (JH)
                                Added remarks on Thai, Lao, and Tibetan.
        26-AUG-1997 Changed canonical ordering table (JH)
        25-AUG-1997 Added correction of canonical ordering table, and
                                list of graphical decompositions (JH)
        17-AUG-1997 Added note on canonical ordering of Indic characters
                            Several minor modifications. (JH)
        16-JUN-1997 Added more notes on convertion from ISCII to Unicode (JH)
        19-MAY-1997 Added notes on convertion from ISCII to Unicode (JH)
        11-MAY-1997 Revision with respect to Unicode 2.014 (JH)
        22-JUN-1994 Revision (JH)

Introduction
------------

Unicode a.k.a. ISO 10646 covers all mayor scripts used in India today. However,
the standard has several inconsitencies, short-comings and peculiarities,
which need to be known to be handled correctly. This document pin-points
the cave-ats, and goes into details on the convertion of ISCII into Unicode.
This document also includes a proposal for idioms to be used for rendering variants.
These will be worked out to a complete conjunct table for each script, which
will eventually appear as an appendix.

It is important to note that the way Indic scripts are encoded in Unicode,
does not allow changing from one Indic script to another by simply adding a
constant to the character-code. Tables will be required to achieve an acceptable
transcription in another Indic script. To minimize the number of tables
required, it is proposed to use the Devanagari block as a generic Indic block,
to and from which all other scripts are translated. The translation tables will
appear as an appendix.

Cultural expected sorting also requires special algorithms. Merely sorting
on code-point values will normally not be acceptable. The algorithms and
sorting orders for each script will appear as an appendix.

Since many Indian typist may type letters "graphically" instead of
phonetically, applications should recognise such graphically typed characters,
and convert them to the correct character. A list of such graphically typed
character sequences is included.

------------------------------------------------------------------------------

Comments on the Encoding of Indian Scripts in Unicode
=====================================================

Explanation of rendering should be included with each script, as each script has
its own peculiarities. Best would be to give a informative overview of each
script, with a proposed set of conjuncts.

A default collating order needs to be defined for each script.

GENERAL REMARKS

Vowel signs on vowels

Sometimes vowel signs on vowels are used in primary school books,
some dictionaries for minority languages, and for transliterating foreign
languages into Indic scripts, and in historical usages

Decomposition of two part vowels

I disagree with the decision that
VOWEL SIGN O is equivalent with VOWEL SIGN E + VOWEL SIGN AA,
although in Tamil this may look the same it is not logically
the same, and the decomposition must be very much discouraged,
as it may cause problems in searching, transliteration to other
scripts where this is not the case, and sorting. For all these
applications an extra step is needed to remove possible decomposed
vowel signs. The problem arises in all scripts using vowel signs
with parts before and after a consonant clusters.

For the same reason, one should warn against encoding Devanagari
VOWEL SIGN O as VOWEL SIGN AA + VOWEL SIGN E, or VOWEL AU as
VOWEL A + VOWEL SIGN AA + VOWEL SIGN E, or VOWEL AA + VOWEL SIGN O,
or VOWEL A + VOWEL SIGN AU. This are four possible spellings which
will look very much the same on screen, but complicate searching a lot.
(actually, most fonts will use such decomposition to render VOWEL SIGN AU)
        
Similar examples can be found in all scripts.

Table of decomposed two part vowel signs:

U+09CB BENGALI VOWEL SIGN O
U+09CC BENGALI VOWEL SIGN AU
U+0B48 ORIYA VOWEL SIGN AI
U+0B4B ORIYA VOWEL SIGN O
U+0B4C ORIYA VOWEL SIGN AU
U+0B94 TAMIL LETTER AU
U+0BCA TAMIL VOWEL SIGN O
U+0BCB TAMIL VOWEL SIGN OO
U+0BCC TAMIL VOWEL SIGN AU
U+0C48 TELUGU VOWEL SIGN AI
U+0CC0 KANNADA VOWEL SIGN II
U+0CC7 KANNADA VOWEL SIGN EE
U+0CC8 KANNADA VOWEL SIGN AI
U+0CCA KANNADA VOWEL SIGN O
U+0CCB KANNADA VOWEL SIGN OO
U+0D4A MALAYALAM VOWEL SIGN O
U+0D4B MALAYALAM VOWEL SIGN OO
U+0D4C MALAYALAM VOWEL SIGN AU

Table of graphically decomposed indic characters

The decompositions indicated here can best be removed from the representation
before attempting to do such things as automatic transcription. The two-part
vowels mentioned above are included in this table.

(note: I have once had 5 MBytes of Hindi text in Devanagari, typed by Gujarati
typists, having all these artefacts of Gujarati script! Worse, they had also
systematically confused the repham with candra e, giving me head-ache to get
something reasonable out of the convertion)

                        graphical
character decomposition name

U+0906 0905 093E DEVANAGARI LETTER AA
U+090D 090F 0945 DEVANAGARI LETTER CANDRA E
U+090E 090F 0946 DEVANAGARI LETTER SHORT E
U+0910 090F 0947 DEVANAGARI LETTER AI
U+0911 0905 093E 0945 DEVANAGARI LETTER CANDRA O
                        0905 0949
                        0906 0945
U+0912 0905 093E 0946 DEVANAGARI LETTER SHORT O
                        0905 094A
                        0906 0946
U+0913 0905 093E 0947 DEVANAGARI LETTER O
                        0905 094B
                        0906 0947
U+0914 0905 093E 0948 DEVANAGARI LETTER AI
                        0905 094C
                        0906 0948
U+0949 093E 0945 DEVANAGARI VOWEL SIGN CANDRA O
U+094A 093E 0946 DEVANAGARI VOWEL SIGN SHORT O
U+094B 093E 0947 DEVANAGARI VOWEL SIGN O
U+094C 093E 0948 DEVANAGARI VOWEL SIGN AU

U+0986 0985 09BE BENGALI LETTER AA
U+09CB 09C7 09BE BENGALI VOWEL SIGN O
U+09CC 09C7 09D7 BENGALI VOWEL SIGN AU

U+0A06 0A05 0A3E GURMUKHI LETTER AA
U+0A07 0A72 0A3F GURMUKHI LETTER I
U+0A08 0A72 0A40 GURMUKHI LETTER II
U+0A09 0A73 0A41 GURMUKHI LETTER U
U+0A0A 0A73 0A42 GURMUKHI LETTER UU
U+0A0F 0A72 0A47 GURMUKHI LETTER EE
U+0A10 OA05 0A48 GURMUKHI LETTER AI
U+0A14 0A05 0A4C GURMUKHI LETTER AU
U+0A42 0A41 0A41 GURMUKHI VOWEL SIGN UU (if stacking)

U+0A86 0A85 0ABE GUJARATI LETTER AA
U+0A8D 0A85 0AC5 GUJARATI VOWEL CANDRA E
U+0A8F 0A85 0AC7 GUJARATI LETTER E
U+0A90 0A85 0AC8 GUJARATI LETTER AI
U+0A91 0A85 0ABE 0AC5 GUJARATI VOWEL CANDRA O
                        0A85 0AC9
                        0A86 0AC5
U+0A93 0A85 0ABE 0AC7 GUJARATI LETTER O
                        0A85 0ACB
                        0A86 0AC7
U+0A94 0A85 0ABE 0AC8 GUJARATI LETTER AU
                        0A85 0ACC
                        0A86 0AC8
U+0AC9 0ABE 0AC5 GUJARATI VOWEL SIGN CANDRA O
U+0ACB 0ABE 0AC7 GUJARATI VOWEL SIGN O
U+0ACC UABE 0AC8 GUJARATI VOWEL SIGN AU

U+0B06 0B05 0B3E ORIYA LETTER AA
U+0B4B 0B47 0B3E ORIYA VOWEL SIGN O
U+0B4C 0B47 0B57 ORIYA VOWEL SIGN AU

U+0B8A 0B89 0BD7 TAMIL LETTER UU (in some styles)
U+0B94 0B93 0BD7 TAMIL LETTER AU
U+0BCA 0BC6 0BBE TAMIL VOWEL SIGN O
U+0BCB 0BC7 0BBE TAMIL VOWEL SIGN OO
U+0BCC 0BC6 0BD7 TAMIL VOWEL SIGN AU

U+0C0B 0C2C 0C41 0C41 TELUGU LETTER VOCALIC R
U+0C13 0C12 0C55 TELUGU LETTER OO
U+0C14 0C12 0C4C TELUGU LETTER AU
(U+0C2E 0C35 0C41 TELUGU LETTER MA will not be confused,
                                                                        as the script uses a special rendering
                                                                        of 0C41 in this case. The same is
                                                                        done in several other appearant cases.)
U+0C47 0C45 0C55 TELUGU VOWEL SIGN EE
U+0C48 0C46 0C57 TELUGU VOWEL SIGN AI

U+0C94 0C92 0CCC KANNADA LETTER AU (approximate)
(same remark as for Telugu, many cases are disambiguated by rendering)
U+0CC0 0CBF 0CD5 KANNADA VOWEL SIGN II
U+0CC7 0CC6 0CD5 KANNADA VOWEL SIGN EE
U+0CC8 0CC6 0CD6 KANNADA VOWEL SIGN AI
U+0CCA 0CC6 0CC2 KANNADA VOWEL SIGN O
U+0CCB 0CC6 0CC2 0CD5 KANNADA VOWEL SIGN OO
                        0CCA 0CD5

U+0D08 0D07 0D57 MALAYALAM LETTER II
U+0D0A 0D09 0D57 MALAYALAM LETTER UU
U+0D10 0D0E 0D46 MALAYALAM LETTER AI
U+0D13 0D12 0D3E MALAYALAM LETTER OO
U+0D14 0D12 0D57 MALAYALAM LETTER AU
U+0D48 0D46 0D46 MALAYALAM VOWEL SIGN AI (if reordering works twice)
U+0D4A 0D46 0D3E MALAYALAM VOWEL SIGN O
U+0D4B 0D47 0D3E MALAYALAM VOWEL SIGN OO
U+0D4C 0D46 0D57 MALAYALAM VOWEL SIGN AU

COMMENTS ON CANONICAL ORDERING

Canonical ordering of combining characters is used to spell words in
a single, unambigious way. This is described in sec 3.9 and table 4.3.
However, the order prescribed here as the canonical order, gives the wrong
orders of vowel signs and vowel modifiers. It is prescribed
on page 6-40, rule R10 (and also in ISCII) that the vowel modifiers
(candrabindu, anusvara) follow the
vowel signs, however, in case both the vowel sign and the vowel modifier
are combining marks according to the Unicode standard, they will be
reordered, for example:

                                        Expected: Reordered
Devanagari: <KA> <vs E> <anusvara> <KA> <anusvara> <vs E>
                                        <KA> <vs I> <anusvara> (not reordered)

The same remark can be made on the svaras (udatta and anudatta) in Devanagari.

The solution will be to re-order table 4.3 as follows: vowel signs first,
then vowel modifiers, then the svaras. As the latter may be applied
to other scripts as well, they should be put after all other Indic scripts.

Note also that Malayalam vowel signs are no longer combining characters, and
should not be re-ordered.

Finally, TAMIL SIGN ANUSVARA seems to be missing in the list.

See the comment on graphical typing of vowel signs below. This should be
un-done for a canonical representation.

Proposed changes to table 4-3.

The best that can be done, is to create a new class, Indic Vowel Modifiers,
before the diacritics, and after the fixed position classes. Putting the
vowel modifiers in class 230, would result in ANUDATTA being ordered before
them, which is incorrect. UDATTA and ANUDATTA are 'true' diacritics.

I would also put all non spacing vowel signs in single classes, depending on
relative position, but this is quite arbitrary, as they are normally not
combined. (in fact, the whole purpose of fixed position classes escapes me)

10--199 fixed position classes

become:

 40 vowel sign below
 50 vowel sign above
 60 length mark below
 70 length mark above
 80 tones
 90 ?
100 vowel modifiers

Code Class Name

U+0D41 0 MALAYALAM VOWEL SIGN U
U+0D42 0 MALAYALAM VOWEL SIGN UU
U+0D43 0 MALAYALAM VOWEL SIGN VOCALIC R

Indic Non-Spacing Vowel Signs

Below
U+0941 40 DEVANAGARI VOWEL SIGN U
U+0942 40 DEVANAGARI VOWEL SIGN UU
U+0943 40 DEVANAGARI VOWEL SIGN VOCALIC R
U+0944 40 DEVANAGARI VOWEL SIGN VOCALIC RR
U+0962 40 DEVANAGARI VOWEL SIGN VOCALIC L
U+0963 40 DEVANAGARI VOWEL SIGN VOCALIC LL
U+09C1 40 BENGALI VOWEL SIGN U
U+09C2 40 BENGALI VOWEL SIGN UU
U+09C3 40 BENGALI VOWEL SIGN VOCALIC R
U+09C4 40 BENGALI VOWEL SIGN VOCALIC RR
U+09E2 40 BENGALI VOWEL SIGN VOCALIC L
U+09E3 40 BENGALI VOWEL SIGN VOCALIC LL
U+0A41 40 GURMUKHI VOWEL SIGN U
U+0A42 40 GURMUKHI VOWEL SIGN UU
U+0AC1 40 GUJARATI VOWEL SIGN U
U+0AC2 40 GUJARATI VOWEL SIGN UU
U+0AC3 40 GUJARATI VOWEL SIGN VOCALIC R
U+0AC4 40 GUJARATI VOWEL SIGN VOCALIC RR
U+0B41 40 ORIYA VOWEL SIGN U
U+0B42 40 ORIYA VOWEL SIGN UU
U+0B43 40 ORIYA VOWEL SIGN VOCALIC R

Above
U+0945 50 DEVANAGARI VOWEL SIGN CANDRA E
U+0946 50 DEVANAGARI VOWEL SIGN SHORT E
U+0947 50 DEVANAGARI VOWEL SIGN E
U+0948 50 DEVANAGARI VOWEL SIGN AI
U+0A47 50 GURMUKHI VOWEL SIGN EE
U+0A48 50 GURMUKHI VOWEL SIGN AI
U+0A4B 50 GURMUKHI VOWEL SIGN OO
U+0A4C 50 GURMUKHI VOWEL SIGN AU
U+0AC5 50 GUJARATI VOWEL SIGN CANDRA E
U+0AC7 50 GUJARATI VOWEL SIGN E
U+0AC8 50 GUJARATI VOWEL SIGN AI
U+0B3F 50 ORIYA VOWEL SIGN I (sometimes renders below)
U+0BC0 50 TAMIL VOWEL SIGN II (often ligates)
U+0C3E 50 TELUGU VOWEL SIGN AA
U+0C3F 50 TELUGU VOWEL SIGN I
U+0C40 50 TELUGU VOWEL SIGN II
U+0C46 50 TELUGU VOWEL SIGN E
U+0C47 50 TELUGU VOWEL SIGN EE
U+0C4A 50 TELUGU VOWEL SIGN O
U+0C4B 50 TELUGU VOWEL SIGN OO
U+0C4C 50 TELUGU VOWEL SIGN AU
U+0CBF 50 KANNADA VOWEL SIGN I
U+0CC6 50 KANNADA VOWEL SIGN E
U+0CCC 50 KANNADA VOWEL SIGN AU

Indic Non-Spacing Length Marks
Below
U+0C56 60 TELUGU AI LENGTH MARK

Above
U+0B56 70 ORIYA AI LENGTH MARK (missing in orginal table)
U+0C55 70 TELUGU LENGTH MARK

Indic Non-Spacing Vowel Modifiers (all above)
U+0902 100 DEVANAGARI SIGN ANUSVARA
U+0901 100 DEVANAGARI SIGN CANDRABINDU
U+0981 100 BENGALI SIGN ANUSVARA
U+0A02 100 GURMUKHI SIGN BINDI
U+0A70 100 GURMUKHI SIGN TIPPI
U+0A71 100 GURMUKHI SIGN ADDAK
U+0A82 100 GUJARATI SIGN ANUSVARA
U+0A81 100 GUJARATI SIGN CANDRABINDU
U+0B01 100 ORIYA SIGN CANDRABINDU
U+0B82 100 TAMIL SIGN ANUSVARA (missing in orginal table)

Indic Svaras
U+0952 220 DEVANAGARI STRESS SIGN ANUDATTA

U+0951 230 DEVANAGARI STRESS SIGN UDATTA

The same for Thai and Lao (needs some more research)

Thai

U+0E31 50 THAI CHARACTER MAI HAN-AKAT
U+0E34 50 THAI CHARACTER SARA I
U+0E35 50 THAI CHARACTER SARA II
U+0E36 50 THAI CHARACTER SARA UE
U+0E37 50 THAI CHARACTER SARA UEE
U+0E38 40 THAI CHARACTER SARA U
U+0E39 40 THAI CHARACTER SARA UU
U+0E3A 9 THAI CHARACTER PHINTHU (Pali virama)
U+0E47 50 THAI CHARACTER MAITAIKHU (is this really a vowel?)
U+0E48 80 THAI CHARACTER MAI EK
U+0E49 80 THAI CHARACTER MAI THO
U+0E4A 80 THAI CHARACTER MAI TRI
U+0E4B 80 THAI CHARACTER MAI CHATTAWA
U+0E4C 90 THAI CHARACTER THANTHAKHAT
U+0E4D 90 THAI CHARACTER NIKHAHIT
U+0E4E 100 THAI CHARACTER YAMAKKAN (what is this?)

(Different semantics of Thai vowel signs needs to be taken care of when
transliterating Pali from Devanagari to Thai or vice versa)

U+0EB1 50 LAO VOWEL SIGN MAI KAN
U+0EB4 50 LAO VOWEL SIGN I
U+0EB5 50 LAO VOWEL SIGN II
U+0EB6 50 LAO VOWEL SIGN Y
U+0EB7 50 LAO VOWEL SIGN YY
U+0EB8 40 LAO VOWEL SIGN U
U+0EB9 40 LAO VOWEL SIGN UU
U+0EBB 50 LAO VOWEL SIGN MAI KON
U+0EBC 40 LAO SEMIVOWEL SIGN LO
U+0EC8 80 LAO TONE MAI EK
U+0EC9 80 LAO TONE MAI THO
U+0ECA 80 LAO TONE MAI TI
U+0ECB 80 LAO TONE MAI CATAWA
U+0ECC 90 LAO CANCELLATION MARK
U+0ECD 90 LAO NIGGAHITA

The same for Tibetan

(This still needs some research)

TODO: The decompositions need to be checked.

U+0F71 40 TIBETAN VOWEL SIGN AA
U+0F72 50 TIBETAN VOWEL SIGN I
U+0F74 40 TIBETAN VOWEL SIGN U
U+0F75 40 TIBETAN VOWEL SIGN UU
U+0F7A 50 TIBETAN VOWEL SIGN E
U+0F7B 50 TIBETAN VOWEL SIGN EE
U+0F7C 50 TIBETAN VOWEL SIGN O
U+0F7D 50 TIBETAN VOWEL SIGN OO

TODO: more.

------------------------------------------------------------------------------
COMMENTS ON DEVANAGARI

There is also a doubled version of DEVANAGARI (LETTER | VOWEL SIGN) SHORT (AI | AU),
used for transcribing short ai or au in some Indian languages [Grierson], suggest to
add

0971 DEVANGARI LETTER SHORT AI
0972 DEVANGARI LETTER SHORT AU
0973 DEVANAGARI VOWEL SIGN SHORT AI
0974 DEVANAGARI VOWEL SIGN SHORT AU

Source: Grierson: A Linguistic Survey of India.

The ISCII-91 standard defines a large number of Vedic accents (which can be
compared with Hebrew cantilation marks) These are not included in Unicode,
but need to. These are included in a separate proposal. As long as this
proposal is not accepted, private area characters will have to be used for
this purpose.

------------------------------------------------------------------------------
COMMENTS ON BENGALI

ISCII-91 makes a distinction between VA and BA (and uses a slightly
different glyph for VA), so I suggest including

09B1 BENGALI LETTER VA

There is some confusion possible between YA and YYA.

YA is pronounced like JA, YYA is pronounced like YA, and acts like YA
in the other Indic scripts. In ISCII-91 Bengali YYA is coded parallel with YA
in Devanagari, and Bengali YA with Devanagari YYA.

This issue should be clarified.

source: ISCII-91 standard

There is no indicated way of encoding the secondary YYA (ya-phala),
following the vowels A and O. I propose to give the following composition
rules:

A + VIRAMA + YYA => A + secondary YYA
O + VIRAMA + YYA => O + secondary YYA

source: S.K. Chatterji, Bengali Self-taught

Danda is commonly used in Bengali, but not encoded. It can be
borrowed from the Devanagari block.

------------------------------------------------------------------------------
COMMENTS ON GURMUKHI

Note: PUNJABI may be a better name than GURMUKHI, as it is clearer.

In ISCII, the distinction between TIPPI and BINDI is made by context,
both are encoded by the same character.

Why GURMUKHI LETTER EE and OO, not E and O?

Given in ISCII-91

0A58 GURMUKHI LETTER QA
0A5D GURMUKHI LETTER RHA

The Gurmukhi alphabetical order is different from the other Indic scripts.
------------------------------------------------------------------------------
COMMENTS ON GUJARATI

Why are the added Gujarati vowels called

        GUJARATI VOWEL CHANDRA E
        GUJARATI VOWEL CHANDRA O

when all other vowels are called LETTER?
Change the names to

        GUJARATI LETTER CHANDRA E
        GUJARATI LETTER CHANDRA O

(still not changed in 2.014)

Danda is commonly used in Gujarati, but not encoded. It can be
borrowed from the Devanagari block.

[v2.0: Gujarati avagraha: glyph printed upside-down?]

------------------------------------------------------------------------------
COMMENTS ON ORIYA

ISCII-91 makes a distinction between VA and BA (and uses a
different glyph for VA), so I suggest including

0B31 ORIYA LETTER VA

as this character is graphically an O with a subscript BA, it can also
be represented as O + VIRAMA + BA; see the remark below.

ORIYA SIGN CANDRABINDU is sometimes treated as a spacing mark.

There seems to be a confusion of the two ya's. The shapes of the glyphs suggest

0B2F ORIYA LETTER YA <-> ISCII-91 CE Consonsant JYA
0B5F ORIYA LETTER YYA <-> ISCII-91 CD Consonant YA

YYA behaves much like YA, so the ISCII encoding of YYA in parallel with YA seems
more sensible. (also see the note with Bengali)

The secondary YYA may also appear after the vowels E and O, and the secondory
letter BA can appear after O. This can be encoded using VIRAMA.

E + VIRAMA + YYA => E + secondary YYA
O + VIRAMA + YYA => O + secondary YYA
O + VIRAMA + BA => O + secondary BA

The decomposition of YYA into YA NUKTA in the 2.014 table seems incorrect.
No Oriyan will consider YYA a YA with a NUKTA (dot).

(see also the comments on Bengali)

Vowel signs for VOCALIC RR, VOCALIC L, VOCALIC LL do not
exist, so they are omitted from the standard correctly.

Danda is commonly used in Oriya, but not encoded. It can be
borrowed from the Devanagari block.

------------------------------------------------------------------------------
COMMENTS ON TAMIL

notice how much TAMIL AU LENGTH MARK looks like a TAMIL LETTER LLA.

Tamil numerals

Tamil numerals do not combine like decimal digits, but rather
like ideographic numbers, i.e. There is no zero, but there are
signs for 10, 100, 1000. To represent 1980, one writes

TAMIL DIGIT ONE
TAMIL NUMBER ONE THOUSAND
TAMIL DIGIT NINE
TAMIL NUMBER ONE HUNDRED
TAMIL DIGIT EIGHT
TAMIL NUMBER TEN,

i.e 1*1000 + 9*100 + 8*10 = 1980

I suggest changing the names of the characters to reflect this,
and to create a new class of numerals that combine in this way,
as the same is seen with Tamil, Sinhalese, and ideographic numbers.

0be7 TAMIL NUMBER ONE
0be8 TAMIL NUMBER TWO
0be9 TAMIL NUMBER THREE
0bea TAMIL NUMBER FOUR
0beb TAMIL NUMBER FIVE
0bec TAMIL NUMBER SIX
0bed TAMIL NUMBER SEVEN
0bee TAMIL NUMBER EIGHT
0bef TAMIL NUMBER NINE
0bf0 TAMIL NUMBER TEN
0bf1 TAMIL NUMBER ONE HUNDRED
0bf2 TAMIL NUMBER ONE THOUSAND

Nowadays, International numerals are used in Tamil.

It seems, genuine decimal Tamil numerals are in use in Mauritius. At least,
I have seen a bank-note with a Tamil zero. Still seeking more information.

There are special symbols for year, month and day in Tamil. These can be added
after the numerals

0BF3 TAMIL SYMBOL FOR YEAR
0BF4 TAMIL SYMBOL FOR MONTH
0BF5 TAMIL SYMBOL FOR DAY

Tamil Om sign, consisting of an O with inscribed MA with ANUSVAR.

0BF6 TAMIL OM SIGN

I can provide samples of the glyphs.

Tamil Grantha

Tamil Grantha is not included Unicode, but can be encoded in parallel with
Tamil. However, a separate encoding seems to be in the making already.

I will work out a proposal for Tamil Grantha.

------------------------------------------------------------------------------
COMMENTS ON TELUGU

Why not add TELUGU SIGN CHECK MARK (or rendering rule to get it)?

ADDITIONAL CHARACTERS IN TELUGU

Source: Lakshmi V.S. Mukkavilli, _TeluguTeX_.

@ dependent vowel signs
0C62 TELUGU VOWEL SIGN VOCALIC L
0C63 TELUGU VOWEL SIGN VOCALIC LL

@ various signs
        TELUGU ARASUNNA
          : telugu sign candrabindu
        TELUGU SUNNA
          : telugu sign anusvara
        TELUGU VISARGA
          : telugu sign visarga
???? TELUGU SIGN ARDHAVISARGA
          visarga of which the circles are open below
0C3D TELUGU SIGN AVAGRAHA
???? TELUGU ?
          sign of which no name is given, looks like an upside down
???? TELUGU SIGN NAKARAPOLLU
???? TELUGU SIGN VALAPALAGILAKA
          looks like telugu sign virama, but is a non-spacing mark
        TELUGU STRESS SIGN UDATTA
          : devanagari stress sign udatta
        TELUGU STRESS SIGN DOUBLE UDATTA
          : vedic long svarita
        TELUGU STRESS SIGN ANUDATTA
          : devanagari stress sign anudatta

Nakanishi gives symbols for one quarter, one half, and three quarters.

                TELUGU SYMBOL FOR ONE QUARTER
                TELUGU SYMBOL FOR ONE HALF
                TELUGU SYMBOL FOR THREE QUARTERS

Danda is sometimes used in Telugu, but not encoded. It can be
borrowed from the Devanagari block.

------------------------------------------------------------------------------
COMMENTS ON KANNADA

(Kannada rra and fa are not in ISCII-91)

Why not add KANNADA SIGN CHECK MARK (or rendering rule to get it)?

Danda is sometimes used in Kannada, but not encoded. It can be
borrowed from the Devanagari block.

------------------------------------------------------------------------------
COMMENTS ON MALAYALAM

cillu letters

Malayalam uses ligatures of the virama with some characters (so called cillu
letters). These should be distinquished from the same letters with a virama.
I suggest using <character><virama><zwj> to produce these letters, and
<character><virama> for the explicit virama. This is in parallel with the
ISCII standard and Unicode conventions. ISCII uses Soft halant for this purpose
as well as for forcing half-consonants in several North Indian scripts.
In one exceptional case, this cillu letter can also appear in a conjunct.
(N + RRA). The proposed idiom for this combination is: N<virama><zwj><virama>RRA.
(this is admittedly ugly)

traditional versus reformed script

The description in the Unicode standard is based on traditional script.
Since 1974 reformed script has been in use.
It should be noted that in reformed Malayalam script, VOWEL SIGN U,
VOWEL SIGN UU, and VOWEL SIGN VOCALIC R are no longer non-spacing marks,
and that VOWEL SIGN VOCALIC RR, and VOCALIC RR, VOCALIC L, VOCALIC LL
have been abolished.

Traditional Malayalam script allows placing a virama on a cluster already
carrying the u matra, to notify a short u sound.

(Malayalam has also been written in Arabic script, using some extra letters,
I will try to find out details)

ADDITIONAL CHARACTERS IN MALAYALAM

@ dependent vowel signs
0D44 MALAYALAM VOWEL SIGN VOCALIC RR

The dependent vowel signs for VOCALIC L and VOCALIC LL appear to be
the same glyphs subscribed, but I haven't found a good reference on
this yet. In any case they are a curiosity.

Also found a single reference to a sign for ONE HALF [Frohnmeyer]. Possibly it is
something like the Bengali currency numerators.

Malayalam numerals are hardly ever used nowadays.

------------------------------------------------------------------------------
COMMENTS ON THAI

TODO (canonical ordering)
------------------------------------------------------------------------------
COMMENTS ON LAO

TODO (canonical ordering)
------------------------------------------------------------------------------
COMMENTS ON TIBETAN

TODO (canonical ordering, composition, meaning of symbols in English)
------------------------------------------------------------------------------

Some Notes on converting ISCII to Unicode
=========================================

Today, ISCII, in its various incarnations is the most widely used character
set for Indian languages. However, round trip convertion between the latest
version of ISCII, from 1991, and Unicode is not possible, for the following
reasons:

1. ISCII makes a distinction between Bengali and Assamese,
2. ISCII encodes font changes,
3. ISCII uses various techniques to encode variant renderings.
4. ISCII encodes several characters not in Unicode.

The first two problems are not within the scope of Unicode, the third is
partly in scope of Unicode (as the variants are sometimes significant),
and the last definitly is, and needs fixing.

Some further issues are described with the respective scripts above.

It should be noted that Unicode is based on an older version of ISCII. This is
no problem, as translation tables/routines should be used in any case. It should also
be noted that translation tables/routines should be used to translate from one
Indic script to another. A mere addition of a constant to the character codes
does not yield defined Unicode characters in all cases.

Semantics of Joiner and Non-Joiner in Relation with ISCII Usage
---------------------------------------------------------------

ISCII-91 uses the doubling of some characters, and nukta to indicate rendering
variants. These idioms will have to be translated to their Unicode equivalents.
The idioms proposed here are choosen such that they follow Unicode conventions
already in place, and such that they have minimal impact on systems that do not
use these conventions: both joiner and non-joiner will not print. [now adapted
to v2.0 conventions]. Some of these conventions are not part of ISCII, but
implemented by its major promoter, C-DAC in Pune in its products.

C - consonant
H - halant/viram
N - nukta
J - joiner
X - non-joiner
V - vowel
M - vowel sign
D - vowel modifier
S - non-spacing diacritic

de facto proposed
ISCII-91 Unicode Semantic
idiom idiom

C1 H C2 C1 H C2 Use standard conjunct of C1 and C2.
C1 H N C2 C1 H J C2 Use half-form of C1, followed by full C2
C1 H H C2 C1 H X C2 Use C1 with halant, followed by C2 (no conjunct)
C1 H H H C2 C1 H J J C2 Use variant conjunct of C1 and C2
                                                                (applications may use more J to indicate further variants)
C1 H N C1 H J (Malayalam) Use cillu letter
C1 H N N C2 C1 H J H C2 (Malayalam) Use secondary consonant under cillu letter

C M C M Use optionally ligating vowel-sign
C M M C X M Use standard vowel sign (no ligating)
C M M M C M J Use ligating variant of vowel sign
                                                                (applications may use more J to indicate further variants)
                                                                
in the above three cases in ISCII this behaviour is only implemented for
a few consonant--vowel sign combinations, for example:

Devanagari: Ru, Ruu, Hr
Gujarati: Ru, Ruu

further idioms can be used for special usages.

                                V M apply vowel sign to a vowel.
                                V H C apply secondary consonant to a vowel.
                                C M H apply both vowel sign and halant (as is used in Malayalam)

------------------------------------------------------------------------------

The syntax of an Indic syllable:

word ::=
  initial-syllable syllables.

(note: in Bengali, word-initial vowel sign e, o, etc. are sometimes rendered
with a different glyph, hence the difference in the syntax)

syllables ::=
    syllable syllables
  | empty

syllable ::=
  base ending
                                
base ::=
    vowel
  | consonant
  | cluster

cluster ::=
  consonant halant cluster
  
consonant ::=
  C [ N ]
  
halant ::=
  H [ J ] [ J | X ]

ending ::=
  [ [ X ] M ] [ J ] [ D ]
  | [ [ M ] H ]

------------------------------------------------------------------------------
Notes on Converting from One Indic Script to Another

Many people in India know more than one language, but only one or two scripts.
For example, somebody living in Orissa may have no problem understanding
Bengali and Hindi, but may not be able to read their scripts. For this
reason, script convertion will often be beneficial. For this purpose,
tables have to be prepared to convert one script to another. I propose
here a set of tables, which map to and from the Devanagari block, to allow such
convertions. The Devanagari block is choosen, because it is the most complete
block in the standard, and the most widely known script in India.

It should be noted that these tables do not give round-trip convertions.

The tables should take into account the possibility of decomposition of
NNNA into N + NUKTA, etc.

DEVANAGARI -> GUJARATI -> DEVANAGARI
DEVANAGARI -> ORIYA -> DEVANAGARI
DEVANAGARI -> BENGALI -> DEVANAGARI
DEVANAGARI -> ASSAMESE -> DEVANAGARI
DEVANAGARI -> GURMUKHI -> DEVANAGARI
DEVANAGARI -> MALAYALAM -> DEVANAGARI
DEVANAGARI -> TAMIL -> DEVANAGARI
DEVANAGARI -> KANNADA -> DEVANAGARI
DEVANAGARI -> TELUGU -> DEVANAGARI

In addition, the following tables will be produced:

DEVANAGARI -> ARABIC -> DEVANAGARI (DV->AR is not Hindi->Urdu, because
                                                                                the Urdu orthography differs)
GURMUKHI -> ARABIC -> GURMUKHI
DEVANAGARI -> ROMAN -> DEVANAGARI (using diacritics)
DEVANAGARI -> ROMAN (without diacritics)

and all other scripts to Roman.

characters not given in the table should not be changed.

TODO add tables
------------------------------------------------------------------------------

+-----------------------------------+-----------------------------+
| Jeroen Hellingman | private: |
| Ericsson Telecommunicatie B.V. | 't Zand 2 |
| Ericssonstraat 2, Rijen, Holland | 4133 TB Vianen |
| Department ETM/RPU, Room 17107 | The Netherlands |
| Tel: +31 161 242022 (834 2022) | |
| Fax: +31 161 249699 | Tel: +31 347 373935 |
| E-mail: <etmjehe@etm.ericsson.se> | |
+-----------------------------------+-----------------------------+



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT