L2/08-085

Report of the South Asia Subcommittee Meeting in Chennai, 2008/01/23-24

draft

The following is a report on the South Asia Subcommittee Meeting in Chennai.

Initial discussions

The following are informal notes that summarize the address from the IT Secretary of the Tamil Nadu Government. The exact words will be attached later.

The government is concerned about some things:

1. Errors in encoding such as non-Tamil characters being in the encoding

2. Efficiencies - Govt of Tamil Nadu is undertaking a massive e-governance efforts. Huge digital libraries are coming up and govt doesn�t want to migrate these massive databases in the future. Govt relies on the experts in the task force and 13 meetings have been held to review and analyze TACE encoding.

3. There may be legal issues as well and the government has to be very careful.

The government's position is that one block for TACE-16 in Unicode would be desirable. The feasibility and practicality need to be investigated. Urged UTC /UC to look at and suggest ways to resolve the issues. The problems are genuine. Tamil is an international language. Used in official transaction in Sri Lanka. International ramifications. We have proposed our solution. We would like UTC�s recommendations.

As for Tamil Nadu Government, it intends to accept the recommendations of the task force and declare a standard and expects the Government of India�s support as well.

We do appreciate that a lot of work has been done.

Appreciate everyone's being as frank and forthcoming as possible.

What are the objections, stance of the UC.

Also need to consider the Tamil Community�s stance

The UC's views are vital, and can suggest better solutions if available

We would like international community�s opinion to make an informed decision

Announced creation of a fund to ease the migration from the old to new.

Some teething problem from old to new

Migration path can be considered.

The government of TN wants to hold a conference in coordination with INFITT.

Opening remarks from Mark Davis; main points in summary:

The UC also has the goal of making Tamil work correctly; we look forward to working with the TNG and the GOI to do this.
An important issue is stability - Unicode is a bit like the banking system - the confidence of members and implementers in the stability is key to its success.
Not breaking existing implementations is vital

Discussion

General agreement about the need to improve the situation for Tamil Users
Key problem is the lack of implementations of Tamil, and the correctness of those implementations. There were many examples of the need to improve.
For the meeting, we'll be following the South Asia subcommittee charter as per the August UTC meeting, copied below.

Step 1: Identification of the issues

Discussion of TACE-16 - proposal for 347 new characters replacing Tamil block. Following issues were raised as part of that presentation.

Significant work since the May meeting on identifying the issues, developing concrete tests
The task force on TACE-16, which are technical advisors to the Government of Tamil Nadu, has recommended to the government the All Character Tamil encoding (TACE-16) be made a government standard to meet the following requirements:
- Handle the emerging needs of e-governance,
- Produce unambiguous and legally indisputable digital records of government documents
- Enable creation of documents that will stand the test of time,
- Be Independent from any external shaping engine
- Be efficient in desktop publishing, linguistic and natural language processing
- Assure safe, unambiguous browsing resistant to domain name spoofing
Issues Raised
- What users think of as characters (syllabic)
- Ambiguous encodings in Unicode (length marks)
- Unicode characters not in collation order
- Simplification of natural-language processing
- Dependence on correct rendering engines
- Fonts not having correct OpenType tables
- Variable levels of support for OpenType fonts
- Efficiency in storage and processing
- See attached report for more details.
Conclusion by TACE task force that Unicode does not match the user's perceptions of characters, and is less optimal than TACE-16 for the measured operations.

In discussion, an issue was raised:

the results need to be reproducible: eg, data made available and at least pseudo-code for the operations

Discussion of Unicode principles relevant to above (Mark Davis, Michael Kaplan)

Unicode character is coded entity ≠ what user thinks of as "letter" or "character". Many examples from variety of scripts. This is true for languages other than Tamil as well. (e.g. Swedish A with a ring character).
Canonical equivalence establishes identity; normalization (NFC) used for unambiguous representation (specifically, the "broken" vowel pieces are combined). Used in important cases like IDNA.
Code order ≠ collation order for any language: eg, Z < a
Display ≠ character codes. Many scripts require more than linear layout. Some of the errors or inefficiencies may be triggered by problems in correctness of implementations. For example, collation, rendering, etc. OpenType fonts with ligature tables for all of the 345 or so Tamil characters identified could be precomposed and mapped to existing Unicode quite efficiently.
Storage is an issue, but not predominant (discussion of UTFs, history of UTF-16). (See also #9)
Stability is a key issue. Unicode is like the banking system. People have to be able to trust that it won't change out from under them. Major clients of Unicode are very dependant on this -- some would much rather have stability than improvements.
Similarity of models helps with implementation. Perceived difficulty of implementation only increases if the language deviates from a family model and stands by itself. There is strength in being part of a family model where only slight modifications are needed to support a new language. (For example, other Indic scripts.)
There may be minority letters in script, or "mistaken" characters like U+0B82 ( ஂ ) TAMIL SIGN ANUSVARA which is not used in any language using the Tamil script. Characters can be annotated (as Anusvara is), or deprecated (stronger), but never removed. (There was discussion of options for this character, as to whether to annotate or deprecate.) Even the name of the character cannot be changed. (There are separate data files in the Unicode Character Database with name annotations, more information, and with correction.) Note: localized names for Tamil characters can be supplied (eg Pulli vs "VIRAMA", or visarga), so that vendors can display the correct name in programs like CharMap.
Results of efficiency vary dramatically according to the code used. Efficiency in storage/transmission are implementation dependent and algorithms can be carefully optimized. Efficiency in processing is a desirable goal but if stability of implementation forces a hit on efficiency that is acceptable. (See also #5)

Notes:

SMP vs BMP

BMP is from 0000 to FFFF. Most common characters, widely supported
SMP is from 10000 to 10FFFF. Infrequent characters, historic scripts. Support in major OS's began a few years ago, but many applications don't fully support. (Examples: Vista supports plane 1 and 2 (only fonts for plane 2)).
BMP code points are typically transmitted as 3 UTF-8 octets while SMP requires 4. In UTF-16, these are 2 bytes for BMP characters, and 4 for SMP characters. (The difference is not double as might be expected. )
Space in the SMP is not constrained, whereas space in the BMP is very confined at this point. In particular, certain areas are reserved for Right-Left characters, which cannot be changed without serious consequences.

Discussion

Members of the TACE taskforce disputed the points about the efficiency/performance issues, and benefits of following the Indic model.
At the time that Tamil was first encoded, it could have followed a syllabic model for encoding like Ethiopic has now.
Implementations quite often may transform Unicode into different internal formats for processing, such as in doing natural-language processing.
If TACE were in the SMP, some problems are avoided -- the main blocker is dual encoding and stability.
Normalization cannot map old characters to new characters, for stability constraints. If a new precomposed character were added, then it would normalized back to its components.
Unicode operating systems (Windows, Mac, etc) convert to Unicode for rendering, etc.

Step 2: Evaluation of possible approaches

We started from the bottom up:

Approach D: TACE-16 as a separate IANA-registered character set
Unicode programs would convert on input to Unicode, process, and emit TACE-16 on output. (Similar to GB 18030.) Non-Unicode programs could process natively.

Pros

No dependency on Unicode - Tamil Nadu government can do independently
TACE16 is very easy to implement; not stateful, easy conversion to and form Unicode
well-established path for charsets -implementations are used to using them
governments have strong sway
the Tamil Nadu government can do exactly what it wants
useful in any closed environment: examples: cell phone, natural-language processing, etc.
well-defined path for programs to support -- programs are used to doing conversion
if multilingual capabilities are required inside the same codepage, then additional repertoire would need to be, eg, for English, French, Telugu, Malayalam, Sinhala, etc.
1. Example: GB 18030 (China) includes all Unicode characters, with an algorithmic mapping to Unicode for most characters.
2. The simpler the mapping to Unicode, the more likely implementations would pick it up.
See iana.org for the list of IANA charsets.
Other TACE advantages: eg Processing using syllables (eg NLP) would use single code points.
On Unicode system, where conversion is done, algorithms depending on Unicode properties would work: line-breaking, sorting, identifiers, etc.

Cons

whether it is added to products depends on company's adding the conversion tables.
for cell-phone environments, 8-bit encoding may be preferred
uptake by companies will depend on critical mass, so a bit of a chicken and egg problem
performance issues need investigation
Typically Unicode programs / OSs will convert to Unicode for rendering, etc. (Linux may not -- needs investigation.) However, typically performance is not substantially impacted for rendering.
Would need to evangelize key players
1. ICU, Windows, Java, PHP, Python, Perl, Linux,...
2. Many will pick up without further evangelization
3. Most are combination of data+algorithms

Approach C: TACE-16 repertoire in the PUA
Pros

No dependency on Unicode - Tamil Nadu government can do independently
Encapsulated in Unicode, so no conversion necessary
SMP PUA is unencumbered - TACE-16 group could establish precedent (homesteading)
Compression of SMP works well
Rendering would be straightforward.
Other TACE advantages: eg Processing using syllables (eg NLP) would use single code points.

Cons

BMP PUA is in wide use for ideographs already, so it probably wouldn't be practical. (needs investigation, there might be enough room)
Overlap problem - some others could use code points for different purpose
Many implementations, & all old implementations, will treat as unknown characters (impacting anything dependant on properties: line-breaking, sorting, identifiers, etc). No standard Unicode properties, so algorithms driven by them won't work
Conversions are needed for interfacing with standards that require standard Unicode. For example, IDNs will be in standard Unicode, requiring a conversion.

Discussion:

legal implications of PUA:
- If the Tamil Nadu government established a standard, then being a standard for legal purposes is not an issue.
- For legal purposes, people need to use final-form document with embedded fonts, for any language.
- Font issues are not specific to PUA - can have font-spoofing in either way.

Approach CD: Approach C, plus register it with IANA as a charset.

Mixture of advantages and disadvantages of above.
Examples:
1. No dependency on Unicode - Tamil Nadu government can do independently
2. In some cases, TACE would convert to and from Unicode; in others it could be interpreted natively.
3. Character properties would be available; all multilingual capabilities would be present;
4. IANA pros and cons from D.

Approach B: TACE-16 repertoire added to Unicode

TACE-16 task force investigated different approaches (listed above, and with full report attached). Major choices are BMP vs SMP. The TACE task force would like to see TACE in the BMP; failing that, the SMP would be an acceptable backup.

Pros

See attached document

Cons

Unless current Unicode model can be shown to not be able to represent Tamil, the duplicate encoding and stability principles would prevent addition.
Accommodating TACE in the BMP would require moving the reserved RTL (U+0800 .. U+08FF) code point range. (Space is not an issue for the SMP.)
- The suggestion from the TACE group is to move the reserved RTL area to
  1. Arabic extensions to U+18B0 .. U+18FF
  2. Mandaic to U+A8E0 .. U+A8FF
  3. Samaritan to U+AB50 .. U+AB7F
  4. Sorang Sng to U+A4D0 .. U+A4FF

Approach B1: Add only "pure consonants" to Unicode

This would be adding what is currently represented as <consonant + pulli> as precomposed characters to the current Tamil block in Unicode.

Pros

Pure consonants represent 30% of the letter frequency in Tamil text
Possible performance benefits in collation, text size (for unnormalized text)

Cons

Would be introducing new precomposed characters
Normalization would replace the new characters with the current ones.

Key Areas where governments, industry, and Unicode can help

There is a natural frustration with programs not being able to handle Tamil, or having errors. Discussed common techniques companies use in prioritizing their work on different languages, and how to leverage improvements.

No matter what approach is taken, common need for the following (draft list)

Identify problems in key application programs and set up communication with vendors
Core set of open source (individual and commercial use) high-quality fonts
Freely available keyboard specifications and IMEs
Central place for developers to go for help with Tamil (on Unicode site or Government site, perhaps wiki?)
Up-to-date locale data (eg CLDR)
Need to investigate having standard ligature table for OpenType to map Unicode sequences to TACE syllables.

Side issue: the Tamil numbers are almost archaic, and offer opportunities for spoofing, so are discouraged for identifiers such as IDN.

Discussion of Unicode Locales Project (CLDR)

(not able to do for lack of time)

We wish to thank our hosts, the Tamil Virtual University and Government of Tamil Nadu

South Asia Charter for Tamil Discussion (L2/07-272, item 10)

Goal: ensure that Unicode meets the needs for representation and processing of Tamil.

This may or may not require the encoding of new characters. Any recommendation should exhaustively examine the implications, including on existing data, on existing software (processing, display, etc), on education about the standard, on consistency of model for theIndic and other South Asian scripts.

The scope of the subcommittee is to review the issues and to make recommendations to the UTC.

Step 1: Identification of the issues

Identify the issues (problems or perceived problems) with the current representation. Determine whether they are issues with the standard itself (encoding, properties, or algorithms) or with implementations. Determine the nature of the issues: technical, perceptual or educational.

Candidate issues:
1 disconnect of the code chart with the user expectations
2 efficiency in storage/transmission
3 efficiency in processing
4 correctness of implementations
5 difficulty of implementation

Step 2: Evaluation of possible approaches

This enumeration of possible approaches does not preclude the examination of other approaches (which may extend on or combine the approaches below). The questions listed for each approach are illustrative of the kinds of questions that need to be answered for a proper evaluation of the approach; they are not exhaustive.

Approach A: current model

How would those issues be addressed with the current representation? Are there any enhancements (new characters, changes to properties, addition of properties, guidelines, documentation in the standard) that would alleviate those issues?

Approach B: TACE-16 repertoire added to Unicode

How would adding the TACE-16 repertoire to Unicode address those issues? And what would be the new problems created by the introduction of that repertoire?

For example:
� dual encoding and stability policy
� does it need to be in the BMP, and if so, how does it fit there?
� would encoding in a non-contiguous area help or hurt compression techniques?

Approach C: TACE-16 repertoire in the PUA

What are the issues that applications are faced with?

For example:
� collisions with other well-established PUA uses, such as CJK:
- there is not always an "official" mapping, different vendors do different things
- PUA conflicts:
HKSCS 9571 (U+2721B) → U+E78D
GB18030 A6D9 (,) → U+E78D
- PUA differentiation:
HKSCS 8BFA (U+20087) → U+F572
GB18030 FE51 (U+20087) → U+E816
� PUA characters cannot be used in IDN.

Approach D: TACE-16 as a separate IANA-registered character set

How simple is it to add support for a new character set (with a well-defined mapping to the existing Tamil block) to exisiting Unicode-based applications? Can this be done in a timely manner, across enough products to achieve viable workflows? What are the implications for already shipped software?

U+0B82 ( ஂ ) TAMIL SIGN ANUSVARA
U+0B83 ( ஃ ) TAMIL SIGN VISARGA
U+0B85 ( அ ) TAMIL LETTER A
U+0B86 ( ஆ ) TAMIL LETTER AA
U+0B87 ( இ ) TAMIL LETTER I
U+0B88 ( ஈ ) TAMIL LETTER II
U+0B89 ( உ ) TAMIL LETTER U
U+0B8A ( ஊ ) TAMIL LETTER UU
U+0B8E ( எ ) TAMIL LETTER E
U+0B8F ( ஏ ) TAMIL LETTER EE
U+0B90 ( ஐ ) TAMIL LETTER AI
U+0B92 ( ஒ ) TAMIL LETTER O
U+0B93 ( ஓ ) TAMIL LETTER OO
U+0B94 ( ஔ ) TAMIL LETTER AU
U+0B95 ( க ) TAMIL LETTER KA
U+0B99 ( ங ) TAMIL LETTER NGA
U+0B9A ( ச ) TAMIL LETTER CA
U+0B9C ( ஜ ) TAMIL LETTER JA
U+0B9E ( ஞ ) TAMIL LETTER NYA
U+0B9F ( ட ) TAMIL LETTER TTA
U+0BA3 ( ண ) TAMIL LETTER NNA
U+0BA4 ( த ) TAMIL LETTER TA
U+0BA8 ( ந ) TAMIL LETTER NA
U+0BA9 ( ன ) TAMIL LETTER NNNA
U+0BAA ( ப ) TAMIL LETTER PA
U+0BAE ( ம ) TAMIL LETTER MA
U+0BAF ( ய ) TAMIL LETTER YA
U+0BB0 ( ர ) TAMIL LETTER RA
U+0BB1 ( ற ) TAMIL LETTER RRA
U+0BB2 ( ல ) TAMIL LETTER LA
U+0BB3 ( ள ) TAMIL LETTER LLA
U+0BB4 ( ழ ) TAMIL LETTER LLLA
U+0BB5 ( வ ) TAMIL LETTER VA
U+0BB6 ( ஶ ) TAMIL LETTER SHA
U+0BB7 ( ஷ ) TAMIL LETTER SSA
U+0BB8 ( ஸ ) TAMIL LETTER SA
U+0BB9 ( ஹ ) TAMIL LETTER HA
U+0BBE ( ா ) TAMIL VOWEL SIGN AA
U+0BBF ( ி ) TAMIL VOWEL SIGN I
U+0BC0 ( ீ ) TAMIL VOWEL SIGN II
U+0BC1 ( ு ) TAMIL VOWEL SIGN U
U+0BC2 ( ூ ) TAMIL VOWEL SIGN UU
U+0BC6 ( ெ ) TAMIL VOWEL SIGN E
U+0BC7 ( ே ) TAMIL VOWEL SIGN EE
U+0BC8 ( ை ) TAMIL VOWEL SIGN AI
U+0BCA ( ொ ) TAMIL VOWEL SIGN O
U+0BCB ( ோ ) TAMIL VOWEL SIGN OO
U+0BCC ( ௌ ) TAMIL VOWEL SIGN AU
U+0BCD ( ் ) TAMIL SIGN VIRAMA
U+0BD7 ( ௗ ) TAMIL AU LENGTH MARK
U+0BE6 ( ௦ ) TAMIL DIGIT ZERO
U+0BE7 ( ௧ ) TAMIL DIGIT ONE
U+0BE8 ( ௨ ) TAMIL DIGIT TWO
U+0BE9 ( ௩ ) TAMIL DIGIT THREE
U+0BEA ( ௪ ) TAMIL DIGIT FOUR
U+0BEB ( ௫ ) TAMIL DIGIT FIVE
U+0BEC ( ௬ ) TAMIL DIGIT SIX
U+0BED ( ௭ ) TAMIL DIGIT SEVEN
U+0BEE ( ௮ ) TAMIL DIGIT EIGHT
U+0BEF ( ௯ ) TAMIL DIGIT NINE
U+0BF0 ( ௰ ) TAMIL NUMBER TEN
U+0BF1 ( ௱ ) TAMIL NUMBER ONE HUNDRED
U+0BF2 ( ௲ ) TAMIL NUMBER ONE THOUSAND
U+0BF3 ( ௳ ) TAMIL DAY SIGN
U+0BF4 ( ௴ ) TAMIL MONTH SIGN
U+0BF5 ( ௵ ) TAMIL YEAR SIGN
U+0BF6 ( ௶ ) TAMIL DEBIT SIGN
U+0BF7 ( ௷ ) TAMIL CREDIT SIGN
U+0BF8 ( ௸ ) TAMIL AS ABOVE SIGN
U+0BF9 ( ௹ ) TAMIL RUPEE SIGN
U+0BFA ( ௺ ) TAMIL NUMBER SIGN