Indic Scripts and Languages
Q: What is ISCII?
A: Indian Standard Code for Information Interchange (ISCII)
is the character code for Indian languages that originate from Brahmi
script. ISCII was evolved by a standardization committee under the
Department of Electronics during 1986-88, and adopted by the Bureau of
Indian Standards (BIS) in 1991. Unlike Unicode, ISCII is an 8-bit encoding
that uses escape sequences to announce the particular Indic script
represented by a following coded character sequence. The ISCII document is
IS13194:1991, available from the BIS offices.
The ISCII Standard can be found on the web, for example at
Sourceforge.
Q: How does Unicode differ from ISCII?
A: Except for a few minor differences, they
correspond directly. Unicode is designed to be a multilingual
encoding that requires no escape sequences or switching between scripts.
For any given Indic script, the consonant and vowel letter codes of
Unicode are based on ISCII. ISCII allowed control over character formation
by combining letters with the characters NUKTA, INV, & HALANT. Unicode
provides similar control with the ZWJ & ZWNJ characters.
The prototypical example is the "explicit halant":
|
ISCII:
|
Halant + Halant
|
|
Unicode:
|
Halant + ZWNJ
|
The "soft halant" of ISCII is expressed:
|
ISCII:
|
Halant + Nukta
|
|
Unicode:
|
Halant + ZWJ
|
The "explicit halant" is discussed in the ISCII standard,
section 6.3.1 and "soft halant" is discussed in 6.3.2.
There are several categories of such differences. See also
Chapter 9.1 - 9.9 of the Unicode Standard for details. Unicode also
includes the right side "pieces" of some two-part vowel signs for
compatibility with some software. For more on vowel pieces, see below.
The ISCII Attribute code (ATR) is not represented in the
Unicode Standard, which is a plain text standard. The ISCII Attribute code
is intended to explicitly define a font attribute applicable to following
characters, and thus represents an embedded control for the kinds of font
and style information which is not carried in a plain text encoding.
The ISCII Extension code (EXT) is also not represented
directly in the Unicode Standard. The Extension code is an escape
mechanism, allowing the 8-bit ISCII standard to define an extended
repertoire via an escaped reencoding of certain byte values. Such a
mechanism is not required in the Unicode Standard, which simply uses
additional code points to encode any additional character repertoire.
Q. Unicode doesn't have an "invisible
letter" (INV) like ISCII. How can I form the combinations that use INV in
ISCII?
A: There are four uses of Nukta in ISCII. Unicode only uses
the first two. Unicode doesn't use nukta for soft halant and doesn't use
it for code extension. Unicode does use nukta to represent the nukta
diacritic either in cases such as "ka" U+0958 or
cases like "nnna" U+0929. Unicode doesn't use nukta for the "om" character
(eg.
chandrabindu + nukta in ISCII, which is encode as a separate character in
Unicode).
One other use of INV in ISCII is as a base letter, this may be
expressed with a space or no-break space in Unicode, depending on whether
the result is to be a "word-like" character or not:
| ISCII |
Unicode |
| INV + vowel-sign |
SPACE + vowel-sign |
| INV + vowel-sign |
NBSP + vowel-sign |
Q: Is India involved in Unicode?
A: The Government of India is a member of the Unicode
Consortium, and has been engaged in a dialogue with the UTC about
additional characters in the Indic blocks and improvements to the textual
descriptions and annotations.
Q: How do the Indic scripts work in Unicode?
A: See Chapter 9 of the Unicode Standard,
South Asian Scripts-I.
Particularly relevant is the section on Devanagari, which is a detailed
description not only of the Devanagari script but also outlines the model
used for all similarly structured scripts in the standard. This model is
the based on the ISCII model.
Information about the OpenType format and the Uniscribe can be found in
the excellent article
Windows Glyph Processing by John Hudson.
[AJ]
Q: Does Unicode cover Vedic accents?
A: Not at this time. No formal proposal has been made, but
the issues are being investigated. Please see the web page
Proposals to Encode Vedic Characters in the UCS, maintained by
Michael Everson.
Q: What is the difference between Unicode
fonts and other fonts?
A: (First, for "What is a Unicode Font" see the
Font FAQ.) The
font would need to contain a glyph for each allocated code point of the
script. For example, Gujarati would contain glyphs for the allocated code
points in the range: U+0A80 - U+0AFF. In addition to these, the font
should have: (a) glyphs for conjuncts; (b) variants for vowel signs
(matras), vowel modifiers (Chandrabindu, Anuswar), the consonant modifier
(Nukta); (c) digits and any appropriate punctuation marks (perhaps some
that are appropriate from the Latin ranges).
The contents of (a) and (b) depend not only on the typographical quality
the font is intended to achieve but also whether the font has glyphs just
in contemporary use or also includes those used in traditional formats.
The contents of (a) and (b) can be accessed by providing a Glyph
Substitution table in the font. Such a table is more often than not a
necessity for Indic scripts. A Glyph Positioning table is also a need for
achieving the minimal required mark positioning in such scripts. More
information on these issues is contained in the
OpenType
Specifications.
There is also a specification for
Creating and Supporting OpenType Fonts for Indic Scripts.
[AJ]
Q: Are there separate Unicode fonts?
A: A font that has glyphs mapped as above is a Unicode font.
Although some tables for such fonts are common and a necessity (cmap,
name, OS/2 etc.); others will depend on the type of glyph outlines
(TrueType, PostScript...) [AJ]
Q: If yes, where are they available?
A: Microsoft has made several OpenType Indic script fonts
with TrueType outlines, such as:
Latha - Tamil
Mangal (Devanagari)
Raavi (Gurmukhi and Devanagari)
Shruti (Gujarati and Devanagari)
Tunga (Kannada and Devanagari)
These fonts are also available for download
from the community site of VOLT (see below).
The Indic fonts shipped with Apple's Indic Language Kit have the proper
AAT tables to do Indic languages via Unicode. See the
Apple Computer main web site.
There are also many other small development teams creating Indic fonts. Many
of them are listed on Alan
Wood's Unicode Fonts page.
Q: Is it possible to convert other fonts to
Unicode?
A: Yes there have been many tools released that will allow a conversion. Some
of the better known ones are:
Microsoft's Visual OpenType Layout Tool (VOLT)
Apples Font
Tools
Adobe's Font Development Kit
Pyrus' FontLab
PFAEDIT (X-11-based, for Mac OSX, Cygwin, etc.) (for
the Linux OS)
Also see the specification for
Creating and Supporting OpenType Fonts for Indic Scripts.
Q: Do I need an IME to properly input Indic script languages?
A: Indic languages can be input via a traditional keyboard,
with a proper keyboard mapping. The work then falls to the rendering
engine to display the characters in their proper order and shape.
[CW]
Q: Is the keyboard arrangement in a Unicode
system different form that of the regular "TTF" fonts?
A: Keyboarding questions are separate from the questions of
encoding. Some of the keyboards provided with Windows can been
seen on Microsoft's
Windows Keyboard Layout website. [AJ]
Q: I have specific questions about Tamil. Where are the answers?
A: See the FAQ devoted to Tamil questions.
Q: What are the Bengali characters used to
transcribe the sound "a" (as in English "bat") in Unicode?
A: Bengali uses the symbol
and some times
to represent this sound when it
begins a word.
These symbols graphically appear to be made up of the letters
(0958)
or
(098F)
plus a squiggle
known as ya-phalaa, and a final
(09BE).
Ya-phalaa is the form the letter YA often
takes when it is the last component of a consonant conjunct. E.g. TA+VIRAMA+YA
may be displayed as TA+YA-PHALAA
In many cases a sequence,... +VIRAMA+YA may expected to produce
a YA-PHALAA.
In view of the graphical appearance plus the common '+VIRAMA+YA' behavior, the
recommendation is to encode these characters as follows:

If a candrabindu or other combining mark needs
to be added in the sequence it comes at the end of the sequence. For
example:

Q: Can you provide a clarification of Bengali Reph and Ya-phalaa usage?
A: The formation of the Reph form is defined in
the Unicode Book, Section 9.1, Rules for Rendering, R2. Basically, the Reph is formed when a Ra which has the inherent vowel killed by the virama/halant
begins a syllable. This is shown in the following example.

The Ya-phalaa is a post-base form of Ya and I formed when the Ya is the
final consonant of a syllable cluster. In this case, the previous
consonant retains is base shape and the virama/halant is combined with the
following Ya. This is shown in the following example.

An ambiguous situation
is encountered when the combination of Ra + virama/halant + Ya is
encountered.

To resolve the
ambiguity with this combination and to have consistent behavior, we need
to look at the processing order of the Bengali script. When parsing the
text, the ability to form the Reph is identified first and therefore the
Reph form should have priority in processing. Thus, it is necessary to
insert a ZWNJ character into the stream between the Ra and virama/halant
to allow the virama/halant and Ya to be grouped together during
processing.

In the example above, the ZWNJ is used because we are
saying that we want two characters that would join by default to remain as
separate entities. In cases other than where the RA is the first character
in the cluster the ZWNJ is not required for the formation of the Ya-phalaa.
However, for ease of placing the Ya-phalaa input as a single key input, it
should be permissible for the Ya-phalaa to be consistently formed by “ZWNJ +
VIRAMA + YA” (U+200C + U+09CD + U+09AF). [PN]
Q: What about collation of Indic language
data? Is that just a binary sort?
A: No. Collation order is not the same as code point order. A
good treatment of some issues specific to collation in Indic languages can
be found in the paper
Issues in Indic Language Collation by Cathy Wissink.
Collation in general must proceed at the level of language or language
variant, not at the script or codepoint levels. See also
UTS #10: Unicode
Collation Algortihm. Some Indic-specific issues are also discussed
in that report.
Q: I cannot find on Unicode charts the "half
forms" of Devanagari letters (or any other Indic script). These characters
are needed to form words such as "patni".
A: Unicode does not encode half or subjoined letters for the
scripts of India. Like in the ISCII standard, Unicode forms all "consonant
clusters" (such as the "tn" in "patni") by inserting the character
"virama" (or "halant") between the two relevant consonant letters. For
instance, the Devanagari syllable "tna" ("
") is encoded
with the following code points:
| U+0924 |
 |
DEVANAGARI LETTER TA |
| U+094D |
 |
DEVANAGARI SIGN VIRAMA (= halant) |
| U+0928 |
 |
DEVANAGARI LETTER NA |
These three characters will be normally displayed using the
single glyph tna ligature "
".
But it is also possible that they are displayed using a half ta
glyph followed by a full na glyph "
", or
even with a full ta glyph combined with a virama glyph and
followed by a full na glyph "
".
Which form will be actually displayed is the decision of an
underlying software module called a "display engine", which bases this
decision on the availability of glyphs in the font.
If the sequence U+0924, U+094D is not followed by another
consonant letter (such as "na") it is always displayed as a full ta
glyph combined with the virama glyph "
".
Unicode provides a way to force the display engine to show a
half letter form. To do this, an invisible character called ZERO WIDTH
JOINER should be inserted after the virama:
| U+0924 |
 |
DEVANAGARI LETTER TA |
| U+094D |
 |
DEVANAGARI SIGN VIRAMA (= halant) |
| U+200D |
 |
ZERO WIDTH JOINER |
| U+0928 |
 |
DEVANAGARI LETTER NA |
This sequence is always displayed as a half ta glyph
followed by a full na glyph "
".
Even if the consonant "na" is not present, the sequence U+0924, U+094D,
U+200D is displayed as a half ta glyph "
".
Unicode also provides a way to force the display engine to
show the virama glyph. To do this, an invisible character called
ZERO WIDTH NON-JOINER should be inserted after the virama:
| U+0924 |
 |
DEVANAGARI LETTER TA |
| U+094D |
 |
DEVANAGARI SIGN VIRAMA (= halant) |
| U+200C |
 |
ZERO WIDTH NON-JOINER |
| U+0928 |
 |
DEVANAGARI LETTER NA |
This sequence is always displayed as a full ta glyph
combined with a virama glyph and followed by a full na glyph
"
For more detailed information, see Chapter 9 of the Unicode
Standard,
South Asian Scripts-I. For related issues, see "Where is My Character?"
[MC]
Q: Bangla should be used in the Unicode
Standard instead of Bengali. Bengali is misspelled too. Originally all
Bangali's (in West Bengal and Bangladesh) spell it as Bangla. What can I
do to correct the spelling ?
A: Nothing, I'm afraid. The Unicode Standard is full of
things like this: many people would prefer the Myanmar letters to be
called Burmese; the term Tibetan is apparently offensive to the people of
Nepal; the letter (U+01A2, U+01A3) named LATIN LETTER OI should be called
LATIN LETTER GHA. As for the spelling 'Bengali', it's the standard English
spelling (my dictionary says from Hindi 'Bangali' in the late 18th
century) and although I'd guess people on this list would be familiar with
the spelling 'Bangla', it's not widespread.
[AV]
Q: I cannot find on Unicode charts the
Bengali khanda ta letter. This character is needed to form words such as
utkarsha.
A: The khanda ta letter was added to the Unicode Standard as of Version 4.1. It is encoded at: U+09CE BENGALI LETTER KHANDA
TA. Use of this character is described in
Version
5.0 in Section 9.2, Bengali.
Q: The Bangla "fullstop" is similar to the
Devanagari danda (U+0964) both being taken from the Brahmi script, but the
corresponding point in the Bengali block at U+09E4 is reserved. To write
Bangla end of sentence (dari) what should I use?
A: All Unicode characters are equally accessible, and many
punctuation elements are used across several scripts. You should use
U+0964 as the danda for several scripts, including Bengali. Also U+0965 is
the double danda for these scripts.
Q: Can you rename the character called VIRAMA in my script to HALANT?
A. In the Unicode Standard, the sign indicating the absence of an inherent vowel in Indic scripts is
denoted by the Sanskrit word virama. In the particular languages another designation is often preferred. In Hindi,
for example, the word hal refers to the character itself, and halant refers to the consonant that has its inherent
vowel suppressed; in Tamil, the word pulli is used; in Bengali, the word hasant is used, and so on.
The Unicode stability policies prevent character names from being changed.
However, the code charts and character descriptions will contain annotations showing the preferred name, such as:
094D DEVANAGARI SIGN VIRAMA
= halant (the preferred Hindi name)
. suppresses inherent vowel
For more information, see
http://www.unicode.org/charts/PDF/U0900.pdf
http://www.unicode.org/versions/Unicode5.0.0/ch09.pdf
http://www.unicode.org/policies/stability_policy.html
http://www.unicode.org/faq/basic_q.html
http://www.unicode.org/faq/indic.html
[MD]
Q. KANNADA VOWEL SIGN I (U+0CBF) and
KANNADA VOWEL SIGN E (U+0CC6) seem to have inconsistent character
properties. They have General Category Mn and Bidi Class L. However, UAX
#9 says that all Me and Mn category characters are Bidi Class NSM. Is
this right?
A. Yes. This was an explicit decision by UTC for these
characters, to preserve canonical equivalence under the Bidirectional
Algorithm for two vowels involving these as parts of decompositions.
The BIDI algorithm is designed to maintain canonical
equivalence. Normally all of the combining characters have the BIDI
class NSM. There are combining characters that would cause problems
for canonical equivalence, and are thus given different BIDI classes.
Q: How are the Sindhi implosives represented?
A. The characters U+097B DEVANAGARI LETTER GGA, U+097C DEVANAGARI LETTER JJA,
U+097E DEVANAGARI LETTER DDDA, and U+097F DEVANAGARI LETTER BBA are used to write Sindhi
implosive consonants, starting with Unicode 5.0. Previous versions of the Unicode Standard recommended representing
those characters as a combination of the usual consonants with nukta, and
anudatta, but
those combinations are no longer recommended. [EM]