L2/08-085

Report of the South Asia Subcommittee Meeting in Chennai, 2008/01/23-24

draft

The following is a report on the South Asia Subcommittee Meeting in Chennai.

Initial discussions

The following are informal notes that summarize the address from the IT Secretary of the Tamil Nadu Government. The exact words will be attached later.

The government is concerned about some things:

1. Errors in encoding such as non-Tamil characters being in the encoding

2. Efficiencies - Govt of Tamil Nadu is undertaking a massive e-governance efforts. Huge digital libraries are coming up and govt doesn’t want to migrate these massive databases in the future. Govt relies on the experts in the task force and 13 meetings have been held to review and analyze TACE encoding.

3. There may be legal issues as well and the government has to be very careful.

The government's position is that one block for TACE-16 in Unicode would be desirable. The feasibility and practicality need to be investigated. Urged UTC /UC to look at and suggest ways to resolve the issues. The problems are genuine. Tamil is an international language. Used in official transaction in Sri Lanka. International ramifications. We have proposed our solution. We would like UTC’s recommendations.

As for Tamil Nadu Government, it intends to accept the recommendations of the task force and declare a standard and expects the Government of India’s support as well.

The government of TN wants to hold a conference in coordination with INFITT.

Opening remarks from Mark Davis; main points in summary:

Discussion

Step 1: Identification of the issues

Discussion of TACE-16 - proposal for 347 new characters replacing Tamil block. Following issues were raised as part of that presentation.

In discussion, an issue was raised:


Discussion of Unicode principles relevant to above (Mark Davis, Michael Kaplan)
  1. Unicode character is coded entity ≠ what user thinks of as "letter" or "character". Many examples from variety of scripts. This is true for languages other than Tamil as well. (e.g. Swedish A with a ring character).
  2. Canonical equivalence establishes identity; normalization (NFC) used for unambiguous representation (specifically, the "broken" vowel pieces are combined). Used in important cases like IDNA.
  3. Code order ≠ collation order for any language: eg, Z < a
  4. Display ≠ character codes. Many scripts require more than linear layout. Some of the errors or inefficiencies may be triggered by problems in correctness of implementations. For example, collation, rendering, etc.  OpenType fonts with ligature tables for all of the 345 or so Tamil characters identified could be precomposed and mapped to existing Unicode quite efficiently.
  5. Storage is an issue, but not predominant (discussion of UTFs, history of UTF-16). (See also #9)
  6. Stability is a key issue. Unicode is like the banking system. People have to be able to trust that it won't change out from under them. Major clients of Unicode are very dependant on this -- some would much rather have stability than improvements.
  7. Similarity of models helps with implementation. Perceived difficulty of implementation only increases if the language deviates from a family model and stands by itself.  There is strength in being part of a family model where only slight modifications are needed to support a new language. (For example, other Indic scripts.)
  8. There may be minority letters in script, or "mistaken" characters like U+0B82 ( ஂ ) TAMIL SIGN ANUSVARA which is not used in any language using the Tamil script. Characters can be annotated (as Anusvara is), or deprecated (stronger), but never removed. (There was discussion of options for this character, as to whether to annotate or deprecate.) Even the name of the character cannot be changed. (There are separate data files in the Unicode Character Database with name annotations, more information, and with correction.) Note: localized names for Tamil characters can be supplied (eg Pulli vs "VIRAMA", or visarga), so that vendors can display the correct name in programs like CharMap.
  9. Results of efficiency vary dramatically according to the code used. Efficiency in storage/transmission are implementation dependent and algorithms can be carefully optimized.  Efficiency in processing is a desirable goal but if stability of implementation forces a hit on efficiency that is acceptable. (See also #5)

Notes:

SMP vs BMP

  1. BMP is from 0000 to FFFF. Most common characters, widely supported
  2. SMP is from 10000 to 10FFFF. Infrequent characters, historic scripts. Support in major OS's began a few years ago, but many applications don't fully support. (Examples: Vista supports plane 1 and 2 (only fonts for plane 2)).
  3. BMP code points are typically transmitted as 3 UTF-8 octets while SMP requires 4.  In UTF-16, these are 2 bytes for BMP characters, and 4 for SMP characters. (The difference is not double as might be expected. )
  4. Space in the SMP is not constrained, whereas space in the BMP is very confined at this point. In particular, certain areas are reserved for Right-Left characters, which cannot be changed without serious consequences.

Discussion

Step 2: Evaluation of possible approaches

We started from the bottom up:
 


Approach D: TACE-16 as a separate IANA-registered character set
Unicode programs would convert on input to Unicode, process, and emit TACE-16 on output. (Similar to GB 18030.) Non-Unicode programs could process natively.


Pros

  1. No dependency on Unicode - Tamil Nadu government can do independently
  2. TACE16 is very easy to implement; not stateful, easy conversion to and form Unicode
  3. well-established path for charsets -implementations are used to using them
  4. governments have strong sway
  5. the Tamil Nadu government can do exactly what it wants
  6. useful in any closed environment: examples: cell phone, natural-language processing, etc.
  7. well-defined path for programs to support -- programs are used to doing conversion
  8. if multilingual capabilities are required inside the same codepage, then additional repertoire would need to be, eg, for English, French, Telugu, Malayalam, Sinhala, etc.
    1. Example: GB 18030 (China) includes all Unicode characters, with an algorithmic mapping to Unicode for most characters.
    2. The simpler the mapping to Unicode, the more likely implementations would pick it up.
  9. See iana.org for the list of IANA charsets.
  10. Other TACE advantages: eg Processing using syllables (eg NLP) would use single code points.
  11. On Unicode system, where conversion is done, algorithms depending on Unicode properties would work: line-breaking, sorting, identifiers, etc.
Cons
  1. whether it is added to products depends on company's adding the conversion tables.
  2. for cell-phone environments, 8-bit encoding may be preferred
  3. uptake by companies will depend on critical mass, so a bit of a chicken and egg problem
  4. performance issues need investigation
  5. Typically Unicode programs / OSs will convert to Unicode for rendering, etc. (Linux may not -- needs investigation.) However, typically performance is not substantially impacted for rendering.
  6. Would need to evangelize key players
    1. ICU, Windows, Java, PHP, Python, Perl, Linux,...
    2. Many will pick up without further evangelization
    3. Most are combination of data+algorithms

Approach C: TACE-16 repertoire in the PUA
Pros
 

  1. No dependency on Unicode - Tamil Nadu government can do independently
  2. Encapsulated in Unicode, so no conversion necessary
  3. SMP PUA is unencumbered - TACE-16 group could establish precedent (homesteading)
  4. Compression of SMP works well
  5. Rendering would be straightforward.
  6. Other TACE advantages: eg Processing using syllables (eg NLP) would use single code points.
Cons
 
  1. BMP PUA is in wide use for ideographs already, so it probably wouldn't be practical. (needs investigation, there might be enough room)
  2. Overlap problem - some others could use code points for different purpose
  3. Many implementations, & all old implementations, will treat as unknown characters (impacting anything dependant on properties: line-breaking, sorting, identifiers, etc). No standard Unicode properties, so algorithms driven by them won't work
  4. Conversions are needed for interfacing with standards that require standard Unicode. For example, IDNs will be in standard Unicode, requiring a conversion.

Discussion:



Approach CD: Approach C, plus register it with IANA as a charset.
 
  1. Mixture of advantages and disadvantages of above.
  2. Examples:
    1. No dependency on Unicode - Tamil Nadu government can do independently
    2. In some cases, TACE would convert to and from Unicode; in others it could be interpreted natively.
    3. Character properties would be available; all multilingual capabilities would be present;
    4. IANA pros and cons from D.
       


Approach B: TACE-16 repertoire added to Unicode


TACE-16 task force investigated different approaches (listed above, and with full report attached). Major choices are BMP vs SMP. The TACE task force would like to see TACE in the BMP; failing that, the SMP would be an acceptable backup.

Pros

  1. See attached document

Cons

  1. Unless current Unicode model can be shown to not be able to represent Tamil, the duplicate encoding and stability principles would prevent addition.
  2. Accommodating TACE in the BMP would require moving the reserved RTL (U+0800 .. U+08FF) code point range. (Space is not an issue for the SMP.)

Approach B1: Add only "pure consonants" to Unicode

This would be adding what is currently represented as <consonant + pulli> as precomposed characters to the current Tamil block in Unicode.

Pros

  1. Pure consonants represent 30% of the letter frequency in Tamil text
  2. Possible performance benefits in collation, text size (for unnormalized text)

Cons

  1. Would be introducing new precomposed characters
  2. Normalization would replace the new characters with the current ones.

Key Areas where governments, industry, and Unicode can help

There is a natural frustration with programs not being able to handle Tamil, or having errors. Discussed common techniques companies use in prioritizing their work on different languages, and how to leverage improvements.

No matter what approach is taken, common need for the following (draft list)
  1. Identify problems in key application programs and set up communication with vendors
  2. Core set of open source (individual and commercial use) high-quality fonts
  3. Freely available keyboard specifications and IMEs
  4. Central place for developers to go for help with Tamil (on Unicode site or Government site, perhaps wiki?)
  5. Up-to-date locale data (eg CLDR)
  6. Need to investigate having standard ligature table for OpenType to map Unicode sequences to TACE syllables.

Side issue: the Tamil numbers are almost archaic, and offer opportunities for spoofing, so are discouraged for identifiers such as IDN.

Discussion of Unicode Locales Project (CLDR)


We wish to thank our hosts, the Tamil Virtual University and Government of Tamil Nadu


South Asia Charter for Tamil Discussion (L2/07-272, item 10)


Goal: ensure that Unicode meets the needs for representation and processing of Tamil.

This may or may not require the encoding of new characters. Any recommendation should exhaustively examine the implications, including on existing data, on existing software (processing, display, etc), on education about the standard, on consistency of model for theIndic and other South Asian scripts.

The scope of the subcommittee is to review the issues and to make recommendations to the UTC.

Step 1: Identification of the issues

Identify the issues (problems or perceived problems) with the current representation. Determine whether they are issues with the standard itself (encoding, properties, or algorithms) or with implementations. Determine the nature of the issues: technical, perceptual or educational.

Candidate issues:
1 disconnect of the code chart with the user expectations
2 efficiency in storage/transmission
3 efficiency in processing
4 correctness of implementations
5 difficulty of implementation

Step 2: Evaluation of possible approaches

This enumeration of possible approaches does not preclude the examination of other approaches (which may extend on or combine the approaches below). The questions listed for each approach are illustrative of the kinds of questions that need to be answered for a proper evaluation of the approach; they are not exhaustive.

Approach A: current model

How would those issues be addressed with the current representation? Are there any enhancements (new characters, changes to properties, addition of properties, guidelines, documentation in the standard) that would alleviate those issues?

Approach B: TACE-16 repertoire added to Unicode

How would adding the TACE-16 repertoire to Unicode address those issues? And what would be the new problems created by the introduction of that repertoire?

For example:
• dual encoding and stability policy
• does it need to be in the BMP, and if so, how does it fit there?
• would encoding in a non-contiguous area help or hurt compression techniques?

Approach C: TACE-16 repertoire in the PUA

What are the issues that applications are faced with?

For example:
• collisions with other well-established PUA uses, such as CJK:
  - there is not always an "official" mapping, different vendors do different things
  - PUA conflicts:
    HKSCS 9571 (U+2721B) → U+E78D
    GB18030 A6D9 (,) → U+E78D
  - PUA differentiation:
    HKSCS 8BFA (U+20087) → U+F572
    GB18030 FE51 (U+20087) → U+E816
• PUA characters cannot be used in IDN.

Approach D: TACE-16 as a separate IANA-registered character set

How simple is it to add support for a new character set (with a well-defined mapping to the existing Tamil block) to exisiting Unicode-based applications? Can this be done in a timely manner, across enough products to achieve viable workflows? What are the implications for already shipped software?

 

U+0B82 ( ஂ ) TAMIL SIGN ANUSVARA
U+0B83 ( ஃ ) TAMIL SIGN VISARGA
U+0B85 ( அ ) TAMIL LETTER A
U+0B86 ( ஆ ) TAMIL LETTER AA
U+0B87 ( இ ) TAMIL LETTER I
U+0B88 ( ஈ ) TAMIL LETTER II
U+0B89 ( உ ) TAMIL LETTER U
U+0B8A ( ஊ ) TAMIL LETTER UU
U+0B8E ( எ ) TAMIL LETTER E
U+0B8F ( ஏ ) TAMIL LETTER EE
U+0B90 ( ஐ ) TAMIL LETTER AI
U+0B92 ( ஒ ) TAMIL LETTER O
U+0B93 ( ஓ ) TAMIL LETTER OO
U+0B94 ( ஔ ) TAMIL LETTER AU
U+0B95 ( க ) TAMIL LETTER KA
U+0B99 ( ங ) TAMIL LETTER NGA
U+0B9A ( ச ) TAMIL LETTER CA
U+0B9C ( ஜ ) TAMIL LETTER JA
U+0B9E ( ஞ ) TAMIL LETTER NYA
U+0B9F ( ட ) TAMIL LETTER TTA
U+0BA3 ( ண ) TAMIL LETTER NNA
U+0BA4 ( த ) TAMIL LETTER TA
U+0BA8 ( ந ) TAMIL LETTER NA
U+0BA9 ( ன ) TAMIL LETTER NNNA
U+0BAA ( ப ) TAMIL LETTER PA
U+0BAE ( ம ) TAMIL LETTER MA
U+0BAF ( ய ) TAMIL LETTER YA
U+0BB0 ( ர ) TAMIL LETTER RA
U+0BB1 ( ற ) TAMIL LETTER RRA
U+0BB2 ( ல ) TAMIL LETTER LA
U+0BB3 ( ள ) TAMIL LETTER LLA
U+0BB4 ( ழ ) TAMIL LETTER LLLA
U+0BB5 ( வ ) TAMIL LETTER VA
U+0BB6 ( ஶ ) TAMIL LETTER SHA
U+0BB7 ( ஷ ) TAMIL LETTER SSA
U+0BB8 ( ஸ ) TAMIL LETTER SA
U+0BB9 ( ஹ ) TAMIL LETTER HA
U+0BBE ( ா ) TAMIL VOWEL SIGN AA
U+0BBF ( ி ) TAMIL VOWEL SIGN I
U+0BC0 ( ீ ) TAMIL VOWEL SIGN II
U+0BC1 ( ு ) TAMIL VOWEL SIGN U
U+0BC2 ( ூ ) TAMIL VOWEL SIGN UU
U+0BC6 ( ெ ) TAMIL VOWEL SIGN E
U+0BC7 ( ே ) TAMIL VOWEL SIGN EE
U+0BC8 ( ை ) TAMIL VOWEL SIGN AI
U+0BCA ( ொ ) TAMIL VOWEL SIGN O
U+0BCB ( ோ ) TAMIL VOWEL SIGN OO
U+0BCC ( ௌ ) TAMIL VOWEL SIGN AU
U+0BCD ( ் ) TAMIL SIGN VIRAMA
U+0BD7 ( ௗ ) TAMIL AU LENGTH MARK
U+0BE6 ( ௦ ) TAMIL DIGIT ZERO
U+0BE7 ( ௧ ) TAMIL DIGIT ONE
U+0BE8 ( ௨ ) TAMIL DIGIT TWO
U+0BE9 ( ௩ ) TAMIL DIGIT THREE
U+0BEA ( ௪ ) TAMIL DIGIT FOUR
U+0BEB ( ௫ ) TAMIL DIGIT FIVE
U+0BEC ( ௬ ) TAMIL DIGIT SIX
U+0BED ( ௭ ) TAMIL DIGIT SEVEN
U+0BEE ( ௮ ) TAMIL DIGIT EIGHT
U+0BEF ( ௯ ) TAMIL DIGIT NINE
U+0BF0 ( ௰ ) TAMIL NUMBER TEN
U+0BF1 ( ௱ ) TAMIL NUMBER ONE HUNDRED
U+0BF2 ( ௲ ) TAMIL NUMBER ONE THOUSAND
U+0BF3 ( ௳ ) TAMIL DAY SIGN
U+0BF4 ( ௴ ) TAMIL MONTH SIGN
U+0BF5 ( ௵ ) TAMIL YEAR SIGN
U+0BF6 ( ௶ ) TAMIL DEBIT SIGN
U+0BF7 ( ௷ ) TAMIL CREDIT SIGN
U+0BF8 ( ௸ ) TAMIL AS ABOVE SIGN
U+0BF9 ( ௹ ) TAMIL RUPEE SIGN
U+0BFA ( ௺ ) TAMIL NUMBER SIGN