L2/01-305

Title: Draft UTC Response to L2/01-304, "Feedback on Unicode Standard 3.0",  
an article published in Vishwabhara@tdil (Newsletter of the TDIL Programme of  
Ministry of Information Technology, Government of India).

URLs:	http://vishwabharat.tdil.gov.in/newsletter1.htm
	http://www.unicode.org/L2/L2001/01304-feedback.pdf

Source: Rick McGowan

Date:	August 8, 2001


===== ABSTRACT

The document L2/01-304 asks for some quite reasonable additional characters,  
provides some annotations and information for block introductions, and also  
request a number of codepoint changes.  This document is a detailed  
preliminary analysis of all the requests and suggestions made in L2/01-304,  
with some suggested actions for UTC and/or the authors of L2/01-304.


===== INTRODUCTION

UTC would like to thank the authors of L2/01-304 for writing this detailed  
analysis of Indic script encoding within the Unicode standard, and looks  
forward to discussion of the various points raised by the document.

Herein pages of the document L2/01-304 will be referred to by number,  
beginning with "Page 15".


===== PAGE 15.

Several points on this page are numbered with small Roman numerals.

In point (iii) of page 15 the document apparently requests what amounts to a  
name change with regard to the terminology "virama" versus "halant", in  
various scripts.  This cannot be accommodated due to UTC and WG2 policy about  
name changes, but probably some explanatory text and/or annotations in the  
name list could be written to clarify the issue, and to discuss the two  
terms.

Point (iv) on page 15 appears to ask for a complete change in the rendering  
model so that "halant" would render conjuncts horizontally while "ZWJ" would  
be used for vertical rendering of conjuncts.  If that is indeed what is being  
suggested, it is not possible to accommodate because it would invalidate the  
entire existing model as well as all existing data and implementations --  
both font implementations and software.  Suggested UTC action: Seek  
clarification from the authors as to the intent of point (iv).


===== PAGE 16.

Point (v) of page 16 seems merely to point out that the last column of each  
128-character block is for language-specific letters.  This is already set  
aside for script-specific entries, both in ISCII and Unicode.

Point (vi) suggests that the authors of L2/01-304 will write up some  
detailed block descriptions for the remainder of the Indic scripts that are  
not already detailed.  This is a very good development, since UTC has not to  
date been able to write these block introductions.

Point (viii) expresses the desire that transliteration between scripts be  
simple and one-to-one.  However, this is not possible without (of course)  
completely
invalidating the existing codes.  The differences between North and South  
Indian scripts would seem to make this a practical impossibility anyway; and  
is furthermore probably contradicted by the apparent plans for Tamil; see  
below.) Clearly it would be a desirable state of affairs, but the document  
offers no further explanation or plan.

Point (ix) of page 16 is headed "Updating constraints of Unicode consortium  
regarding character encoding stability".  It seems to be merely a quotation  
of the Unicode policies as expressed in:

	http://www.unicode.org/unicode/standard/policies.html

In this case, UTC should inquire as to why, given rule (a) -- that  
characters once encoded will not be moved or removed -- the document goes on  
to propose removal or movement of a substantial number of characters. It  
would appear that this point may be asking that fundamental policies be  
changed to accommodate proposed incompatible changes in various scripts.   
This issue must be clarified.  Suggested action for UTC: obtain clarification  
from the authors on the intent of including point (ix), page 16; and  
re-iterate policies of both Unicode and WG2.

===== PAGE 17.

The remainder of L2/01-304 is divided into sections for each of several  
scripts, or languages, beginning with Devanagari.  Likewise, this document  
follows that structure with headers for clarity in matching responses to  
L2/01-304.

Codepoints given below, throughout the rest of this document are in  
hexadecimal and refer to codepoints as used or suggested in L2/01-304.   
Please note that many of the codepoints under discussion herein are not  
encoded in the Unicode standard, but are encoding suggestions made in  
L2/01-304.  Here, the suggested codepoint numbers are retained only for  
clarity in matching responses to L2/01-304, and do not imply any endorsement  
by UTC or final encoding in any way.


DEVANAGARI.

A chart is presented on page 16, which coincides precisely to the Unicode  
chart, with some additions, and removal of several consonants with nukta.

0904, 093A, 0955, 0956.  The document proposes a number of additions which  
appear to be fine candidates for encoding. As with all suggested additions,  
UTC needs detailed explanations of their usage, form, etc, and detailed WG2  
forms will be needed.  The document itself does not provide sufficient detail  
for adding these characters to be formally proposed.  Suggested UTC action:  
for these, and all other characters mentioned below, obtain further details,  
and work with MIT experts to draft proposed additions; then submit proposals  
with WG2 forms.

0958 - 095F.  The document proposes to discourage the use of these  
precomposed characters with nuktas. By putting them into the composition  
exclusions list, UTC has already excluded them from the Form C normalization.  
 Annotations and cautionary statements could also be added to that effect,  
with whatever degree of strength is appropriate.

094D.  A change in the character name is suggested.  UTC might want to make  
an annotation, since a name change is not possible.

0970.  The document suggests an annotation or explanation which is suitable.  
 Suggested UTC action: add explanation.

The document points out several representative experts who apparently were  
consulted during the preparation of this document.  It also suggests a number  
of explanations and details that could be added to the block introductions,  
e.g., for the Konkani language, written with Devanagari.  Suggested UTC action: add Konkani specific remarks to the Devanagari block introduction, as  
illustrated on page 17.

0974.  DEVANAGARI LETTER SHORT YA is proposed for use in Sindhi, which seems  
to be a fine addition.  UTC action: request further details and add to list  
of proposed items.

Some other Sindhi related comments and explanations follow, e.g., for 0952.   
Suggested UTC action: add these comments to the Devanagari block  
introduction.


===== PAGE 18.

BENGALI.

For Bengali, as well as a number of other scripts, the document proposes to  
add DANDA and DOUBLE DANDA clones.  It was long ago decided in UTC not to  
clone these punctuation characters.  Therefore all of the suggestions for  
adding DANDA and DOUBLE DANDA characters for all of the scripts must be  
declined.  However, the block introductions should specify where users are to  
look for the DANDA and DOUBLE DANDA characters (in the Devanagari block).

The document also proposes the addition of INVISIBLE LETTER in a number of  
scripts.  INVISIBLE LETTER will be dealt with here, and not below under each  
place it is proposed.  It was long ago decided that Unicode would not use  
this INVISIBLE LETTER, and the mechanisms are well explained in relation to  
Devanagari, etc, about ZWJ.  This has been discussed elsewhere, but comments  
aren't available.  Here is one note from Anupam Saurabh, May 27 1998:

  "The INV is used to simulate a joining and display of resultant glyph with
  an invisible consonant. ZWJ as described on page 6-71 of Unicode 2.0 is
  used to alter the behavior of rendering process as if it had been joined
  with either preceding or following character, or both. It also mentions
  the function of ZWJ for Indian languages, and the explanation is
  identical to that of INV in ISCII. Apart from difference in the
  language of explanation, I do not find any other difference."

09BD.  "Avagrah" for Bengali is proposed.  Suggested UTC action: add this  
and several other Avagraha characters that are proposed for other scripts to  
the list of suggested additions.  These are well-understood already, so no  
further information is required about each of them; they merely need to be  
added to the list of proposed additions.

09CE, 09CF, 09DE.  The document suggests adding signs for Bengali YA, RA,  
and LA.  This would apparently be a change to the model for Bengali (which  
would not be possible), and it needs to be considered in detail, with  
reasoning for the proposal outlined.  In any case, some detailed explanation  
is needed.  Proposed UTC action: request further clarification and detailed  
explanation of these proposed additions for Bengali.

09F4, 09F5, 09F6.  Changes in namelist annotation that seem fine and should  
be added to the list of proposed additions.


GURMUKHI.

0A01.  Gurmukhi sign "adak bindi" and a visarga are suggested.  These seem  
fine, and should be added to the list of proposed additions.  Further  
clarification or documentation would be useful.

0A50. This suggestion amounts to moving the existing U+0A74 to 0A50.  It is  
not possible to move it.  The suggested annotation is already made in the  
name list for U+0A74.  No action is needed.

0A4E, 0A4F.  These two proposed additions for RA and HA subjoined signs for  
Gurmukhi would possibly change the model, and UTC should request detailed  
explanation to decide whether they are reasonable additions.

0A64, 0A65.  More Danda, Double Danda.  See above.

0A78.  The document suggests addition of "Gurmukhi Sign Khanda" which is  
identical to the character already encoded as U+262C.  There is no need to  
add a clone of this character here, but the block introduction could point  
out that it is encoded at U+262C.

0A33.  A shape change in the chart is suggested.  This needs detailed  
documentation as to whether this is a simple mistake, or a font detail.   
Suggested UTC action: seek clarification and reasoning for the suggested  
change.

The document suggests moving 0A74 to 0A50, which will not be possible.  But  
it is also said to be "not used", so maybe an annotation is in order  
regarding its obsolescence?  Suggested UTC action: request clarification, and  
propose an annotation, if needed.


GUJARATI.

0A8C.  Gujarati vocalic L is suggested and is probably fine for UTC to add  
this to the list of proposed additions without further explanation, as it is  
a well-understood letter.

0AD1, 0AD2, 0AD3, 0AD4.  These are some new accent marks that appear similar  
to, or even identical to, accents in the U+0300 area.  They need to be  
looked at in detail and explanations provided.  But they should probably not  
be added, and just use the existing non-spacing marks.  Suggested UTC action:  
request further explanation of these marks, and suggest using the existing  
U+0300, etc, where applicable.

0AE1, 0AE2, 0AE3. These are suggested additions of Vocalic L, LL. Probably  
fine for UTC to add this to the list of proposed additions without further  
explanation, as they are well-understood letters.

0AF1.  A rupee sign for Gujarati.  Probably fine to add this, since it looks  
like a symbol that's not made from pieces that are already encoded Gujarati  
characters.  Since the form of this character is very Gujarati-like, it  
should probably be proposed for encoding at this location, rather than in the  
Currency Symbols block.


ORIYA.

0B0A, 0B0B, 0B48, 0B4C.  The document proposes changing shapes of these five  
codepoints, but there is no explanation.  Details are needed before a  
determination can be made.  Are these simply font differences?  They may be  
so.  Suggested UTC action: request clarifications as mentioned.

0B66. The shape/size of the representative glyph for the "zero" character is  
probably fine to change; the document gives some detail as to why it should  
be smaller than 0B20, avoiding confusion.  Suggested UTC action: amend the  
charts for this character.


===== PAGE 19.

(ORIYA, cont'd.)

The document suggests removing the annotation under character 0B2C; probably  
because it also suggests the addition of an Oriya "va" character at 0B35.   
Suggested UTC actions:  remove the annotation and put "va" on the list of  
proposed additions for Oriya.

0B64, 0B65, 0B3A.  The dandas and invisible character are again proposed to  
be cloned here, which cannot be accommodated.  See the explanatory  
information above under Devanagari.


TAMIL.

0B83.  The document indicates that this is not a combining character at all,  
but an independent character.  Maybe need to remove the dotted circle.  In  
any case, it needs investigation, since it has "Mc" category in the Unidata.   
This may be one mistake that UTC will have to work around by deprecating  
this character and adding an appropriately SPACING character at another  
location.

0B82.  The document suggests that this is not used in Tamil.  Presumably,  
this means that the Tamil language itself does not use it.  Suggested UTC  
action: clarify this, and annotate the character "for use in Sanskrit" if  
appropriate.

Here for TAMIL, and in other places below for other scripts, the document  
strongly indicates that the Unicode encoding with respect to the two-part  
vowel signs is considered simply incorrect.  It is apparently desired that  
Unicode remove the explanation of using sequences like "<consonant>, 0BC6,  
0BBE" instead of the two-part vowel symbol 0BCA after the consonant.  This  
probably needs some detailed discussion in UTC.  The document suggests not  
using the split pieces, but the two-part vowel signs.  Normalization form NFC  
should then be preferred here, and UTC may want to annotate this, and/or  
deprecate the split-up pieces in some cases.

Under the heading "TAMIL" the item numbered (3) is a serious issue: "Tamil  
letter sequencing as in the Unicode Standard 3.0 is also not acceptable.  New  
code-set is being worked out."  This looks like groundwork to ask for a  
complete overhaul and replacement of Tamil encoding.  Suggested UTC action:  
work with the experts, MIT and INFITT to show the workability of the current  
Unicode Tamil encoding.  If MIT goes ahead with endorsing an entirely new  
Tamil encoding within India, UTC should propose working together to specify  
precise mapping between the existing Unicode Tamil encoding and whatever  
local Indian standard is proposed to replace the ISCII Tamil encoding.


TELUGU.

Again the document asks for invisible letter and change in halant/virama  
naming; see above under Devanagari.

0C3D, 0C3C, 0C0D, 0C11, 0C34.  These seem like five reasonable additions for  
Telugu, and UTC should probably add them to the list of proposed additions  
for Telugu.


KANNADA.

0CBC, 0CBD. The document requests Nukta and Avagraha to be added for  
Kannada.  These are well-understood additions and should be put onto the list  
of proposed additions without requiring any further information.

0CD2, 0CD1, 0CF9, 0CD3, 0CD4.  Several additional diacritics are suggested,  
but there is not enough explanation for UTC to come to a determination.   
Some are probably just clones of non-spacing marks in the U+0300 block, and  
need to be explained before UTC can determine what to do.  As with similar  
diacritical marks suggested in previous sections, UTC should request  
clarification and point out the existing marks.

Again, the document requests that Unicode "delete" the equivalences for  
split vowel signs for Kannada.  Suggested UTC action: discuss in conjunction  
with the similar requests above for Tamil.  (Note: The previous L2 document  
(L2/01-037) submitted by the Directorate of Information Technology,  
Government of Karnataka, also points out that the various "length marks" have  
no independent existence;  see L2/01-037 page 4 of 17.  Furthermore, 0CD5  
and 0CD6 as well as 0CE1 are therein suggested for deletion; see page 11 of  
17.)


===== PAGE 20.


MALAYALAM.

A number of additions are requested, which seems fine, but need to be  
explained and documented before they can be added to the list of proposed  
additions for Malayalam.  The document also suggests changing the  
representative shape of 0D4C, but UTC should request confirmation and an  
explanation of the motivation for the proposed changes before taking any  
action.

Then the document requests that names of seventeen consonants be changed,  
which cannot be accommodated.

The document also suggests removing the character U+0D57 as a duplicate of  
U+0D4C.  Suggested UTC action: the claim must be investigated, and possibly  
one of them deprecated; but 0D57 cannot be removed.


ARABIC.

The document suggests adding three characters at 0656, 0657, 0658, since  
they are used in Urdu.  Suggested UTC action: request further information and  
documentation before adding them to the list of proposed additions for  
Arabic.  As suggested in correspondence to UTC from IBM, it will probably be  
found expedient to make any such additions to the Extended Arabic block  
rather than at the proposed locations.

The document suggests removing one annotation for 0690, which could be  
reasonable, and several other annotations as well, which should be  
investigated.  Suggested UTC action: investigate the proposed annotations.

The document ask for a new diacritical for "Hamza" at U+0659, but this seems  
to be already encoded at U+0654.  Suggested UTC action: request  
clarification.

Annotations are suggested for pointing out Sindhi shape differences for some  
numerals 0664 - 0667.  This are probably reasonable additions for the Arabic  
block introduction, and UTC should probably just add this information.


D:/Uniw/L2-Docs/L2-01-305.txt