Comments on Public Review Issues (May 28 - August 4, 2004)

The sections below contain comments received on the open Public Review Issues as of August 4, 2004, since the previous cumulative document was issued prior to UTC #99 (June, 2004).

25 Proposed Update UTR #17 Character Encoding Model

Date/Time: Tue Jul 20 01:25:37 CDT 2004
Contact: Tex Texin Report

Brief mention of utf-7 should be considered for completeness.


31 Cantonese Romanization

No feedback received this period. See L2/04-173.


33 UTF Conversion Code Update

No feedback received this period. (Prior feedback from Sandra O'Donnell will be incorporated into an update.)


34 Draft UTS #35 Locale Data Markup Language, Version 1.2

Feedback for this PRI is directed to the CLDR bug database.


37 Clarification of the Use of Zero Width Joiner in Indic Scripts

Date/Time: Wed Jul 28 06:15:00 CDT 2004
Contact: Peter Kirk

I am concerned about the proposed change of the representation of Bengali Reph and Ya-phalaa from to . The sequence is already a ligature, or at least most naturally implemented as such. Insertion of ZWJ into such a ligature should not, according to TUS section 15.2, affect the rendering of the ligature. Many current implementations provide no mechanism by which ZWJ can have such an effect. The more natural sequence is the already defined , in which ZWNJ breaks the three-component ligature and so allows rendering as separate RA and a (C2-conjoining) ligature.

The same objection would appear to apply to all or many of the solutions proposed in section 6.2 of the review document. It seems that in these the function of the ZWJ is to break the (C1-conjoining) conjunct ligature which would otherwise be formed between the base character and the virama. But ZWJ cannot and should not be used to break ligatures. The more appropriate control character for this function is ZWNJ, which is specified as breaking ligatures and which actually performs this function in existing implementations.

Date/Time: Fri Jul 30 02:22:24 CDT 2004
Contact: anir_udr@rediffmail.com

Showing isolated C2 conjoining form is a necessity in Bengali for transforming teaching aids [such as alphabets for children teaching thenm conjunct formation] written using legacy fonts. As per the document PRI#37 below base C2 conjoining form exists for Ra and Ba only. But as a matter of fact, most of Bengali consonants, has below case C2 conjoining form in traditional typography. I feel the fallback on FullC1 + C2 conjoining is better than half C1+ Full C2 in absence of a C1-C2 conjunct in the font.

Another requirement is ISCII compatibility. In ISCII based applications if I write INV+Virama+Ma I get isolated below base form of Ma (or Na or La as the case may be). Unicode should support such behaviour so that NBSP+ZWJ+VIRAMA+MA leads to formation of isolated below base form of Ma [ called Ma-phala in Bengali Alphabatic books]

Dr Anirban Mitra

Amature Typographer and free software enthusiast

Date/Time: Mon Aug 2 13:28:36 CDT 2004
Contact: William Overington

I have been thinking about Public Review 37 for Unicode.

Firstly, I find the proposed additional mechanism very interesting and I support its introduction. I feel that the encoding method used is an imaginative use of the encoding system so as to provide a solution to a problem.

I wonder if the Unicode Technical Committee could consider the following please.

1. Are the following meaningful or erroneous?



I feel that it would be helpful, should the proposed additional mechanism become added into the Unicode Standard, for the meanings or erroneousness of the above to be stated in the documentation.

If these two encodings are presently unused, perhaps it might be interesting to consider whether there are any special situations which could be encoded by using them. It would be helpful for that to be done now as then a note that encountering C1 ZWJ VIRAMA could possibly be the start of a sequence C1 ZWJ VIRAMA ZWJ C2 rather than necessarily the start of a sequence C1 ZWJ VIRAMA C2 could be included in the Unicode Standard. Even if there is no present use for a ZWJ VIRAMA ZWJ sequence it could perhaps be noted as a distinctive sequence which is reserved for possible future use.

2. The Unicode Standard 4.0 Chapter 9 has the following in section 9.1 page 222.

Consonant Conjuncts. The Indic scripts are noted for a large number of consonant conjunct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent letterforms.

I note the phrase "two or more".

In view of this, I wonder if the Unicode Technical Committee could please consider including some notes on what happens with three consonants.

For example, sequences such as C1 VIRAMA ZWJ C2 ZWJ VIRAMA C3 and so on.

It seems that the following sequences could all occur in a file with some having meaning and some perhaps being erroneous.

C1 P C2 Q C3

Each of P and Q could be any of the following.


3. In relation to this Public Review and more generally in Unicode. Now that ZWNJ, ZWJ and CGJ are being used within sequences to express various glyphs such that it is becoming necessary to have two views of a document, namely "source code view" and "finished display view", I wonder if the Unicode Technical Committee would please consider having standard display glyphs for how ZWNJ, ZWJ and CGJ should appear when displayed in source code view, where they would not have zero width. Some fonts do have such symbols, yet they are not standardized. I feel that standardized symbols would be a good basis for future development. For example, there are symbols for ZWNJ, ZWJ and CGJ in the Quest text font which I have published on the web, yet they are not standardized glyphs. I am aware that symbols could be constructed showing the letters ZWNJ and so on, based on the displays in the code charts, yet feel that specific symbols would be better, both generally for all fonts and particularly for fonts which are not based on the Latin alphabet.

The choice of the specific symbols to be used as standardized glyphs for ZWNJ, ZWJ and CGJ in source code view of plain text documents is a different issue from the issue of whether to define standardized glyphs for ZWNJ, ZWJ and CGJ as such.

The Quest text font is available for free download from the web if you perhaps would like to look at the symbols which I have used for ZWNJ, ZWJ and CGJ in that font. The download page is as follows.


William Overington

2 August 2004

Date/Time: Tue Aug 3 20:53:44 CDT 2004
Contact: Antoine Leca

[The entirety of Antoine's document is found in L2/04-328 as it is too long to reproduce here. See also URL: http://antoine.leca.free.fr/devanagari/PR37.html ]

Date/Time: Thu Aug 5 12:00:49 CDT 2004
Contact: S. Madhu Sudan Singh
Subject: Clarification of the Use of Zero Width Joiner in Indic Scripts

With reference to the abovementioned subject, I am to inform you that the proposal for use of Unicode Control character ZWJ and ZWNJ in Indic scripts has been examined at our end and the proposal appears to suitable with respect to Manipuri (Bengali script as used in Manipuri).

S. Madhu Sudan Singh
Director (S&T)
Directorate of Science & Technology
Government of Manipur

38 Draft Unicode Technical Report #30 Character Foldings

Date/Time: Sat Jul 17 01:27:49 CDT 2004
Contact: Jony Rosenne

The new Hebrew Qamats Qatan should be folded to Qamats.


Date/Time: Sat Jul 17 16:43:18 CDT 2004
Contact: Peter Kirk

The draft UTR #30 is good. But the specification of diacritic folding in http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt is totally inappropriate.

The correct behaviour for diacritic folding is to remove diacritics regardless of their context - except perhaps in certain special cases. This implies that the folding specification need list only all Unicode combining marks. There is no need to specify any complex forms because the folding algorithm includes decomposition. The only additional forms which may need to be included are those, e.g. o with stroke, which do not have canonical decompositions but which (arguably) should be included in this folding.

But the effect of the folding specification as given is to remove only diacritics which appear in the specific contexts defined in the file, which are apparently only those which have precomposed forms. The paradoxical result of applying this specification to Hebrew is that dagesh and a few vowel points will be removed (although dagesh may not be removed if separated from the base character by a vowel point in canonical order) but most vowel points and accents will not be removed. But in fact all (apparently) of the mappings listed are redundant if all combining marks are mapped to null.

Date/Time: Mon Jul 19 10:43:36 CDT 2004
Contact: Kent Spielmann (SIL)

Please remove "unintended" in section 2.2 P3 where it reads: "unintended false positive" and "unintended false negatives". Can't we assume a false positive or negative is unintended?

Date/Time: Tue Jul 27 13:35:24 CDT 2004
Contact: Mark Davis

The DiacriticFolding would be better expressed as an AncillaryDecompositions.txt, removing all the mappings that are merely duplicates of the canonical decompositions, and retaining only mappings like: 00D8; 004F 0335 # Ø LATIN CAPITAL LETTER O WITH STROKE This reduces duplication of data, and retains the data in a form that is most useful. If someone wants to strip the combining marks, that is easy to do afterwards; but you can't add them if they are not in the data in the first place. The extra data is particularly useful for detecting spoofing: see my document on Unicode Security Considerations.

Date/Time: 28 July 2004
Contact: Kent Spielmann

The text of Kent's document, submitted as PDF, is in L2/04-312.


39 Draft Unicode Technical Standard #31 Identifier and Pattern Syntax

No feedback received this period.


Other Hebrew Issues

Date Fri, 16 Jul 2004 13:50:11 -0400
Contact: John Cowan
Subject: Hebrew puncta extraordinaria (submission for August 2004 meeting)

The annotation "punctum extraordinarium" should be added to 05C4, for the elimination of doubt, and in parallel to the annotation of the new 05C5. In addition, it would be better to use diamond-shaped glyphs rather than round ones; although this is not the universally used form for the puncta extraordinaria, it is the most distinctive form.

See http://qaya.org/academic/hebrew/Issues-Hebrew-Unicode_html_m46d6520f.png  for a picture of the relevant word from Ps. 23:17 in Biblia Hebraea Stuttgartensia, the most widely used scholarly edition of the Hebrew Bible. There is an upper and lower dot on each base letter.

Date/Time: Wed Aug 4 10:16:21 CDT 2004
Contact: Peter Kirk
Subject: Hebrew Holam: a precedent from Khmer

In the proposal L2/04-307 of which I am a co-author, I wrote the following:

There is a precedent for such a sequence in the sequence defined for Bengali Reph and Ya-phalaa in TUS version 4.0.1.

I note that there were additional precedents already in TUS 4.0.0, in the Khmer sequences and, in a muul form, , which are defined on pp.282-283. These are closer parallels to than the Bengali example, because the combining characters must be rendered as separate marks properly positioned relative to the base characters, although separated from them in the character stream by ZWNJ.

From: Bearpecs@aol.com
Date: 2004-08-04 22:35:43 -0700
Subject: Re: [b-hebrew] Unicode Holam proposals  

To the Unicode Consortium:

The difference in Hebrew script between the consonant waw in combination with the vowel holem vs. the vowel holem-male which looks similar is a distinction often ignored, but nonetheless real. As a frequent user of Hebrew wordprocessing, I have been delighted by the growing use of Unicode fonts which simplify working in different software applications and encourage a vibrant Unicode standard for Hebrew. I am writing to endorse the  proposal for distinguishing between Holam Male (full Holem) and Vav Haluma (consonantal Waw with defective Holem) in Unicode Hebrew texts (http://www.qaya.org/academic/hebrew/Holam3.pdf, or ...Holam3.html) submitted by Peter Kirk. This proposal is based on using the ZERO WIDTH NON-JOINER character to distinguish between Holam Male and Vav Haluma. This proposal is far superior to a proposal for a new character for use as Holam Haser (defective Holem) only when used with Vav, and the latter proposal should be rejected.

Hayyim Obadyah

From: Pere Casanellas
Date: 2004-08-05 07:07:49 -0700
Subject: UTC meeting 10th August: Hebrew Holam proposals. & Upper vocalization.

* Michael Everson and Mark Shoulson. Proposal to add HEBREW POINT HOLAM HASER FOR VAV to the BMP of the UCS. ISO/IEC/JTC1/SC2/WG2-N2840 L2/04-310
* Peter Kirk et al. New proposal on the Hebrew vowel HOLAM L2/04-307 (UTC meeting 10th August)

Dear Sirs,

I have read the two above proposals on the Hebrew vowel 'holam'. I want to assert my preferences for the proposal of Peter Kirk et al. The reasons are clearly explained in the last paragraphs of their proposal. As for the proposal of Michael Everson and Mark Shoulson, I specially dislike having two different characters for 'holam haser': one character when the 'holam haser' is used with the consonant vav and the other one when it is used with other consonants; this seems to me that should be avoided.

Since after this aspect be solved, Hebrew with Tiberian vocalization maybe will be perfectly represented in UNICODE, I hope in the nexts months proposals will be submitted and approved as for the upper vocalization of Hebrew (and Aramaic): that is to say, Babylonian and Palestinian vocalizations. In the nexts years the International Organization for Targumic Studies will begin an important project: to create a database recording the texts of all the relevant manuscripts of the Targums (the old versions of the Hebrew Bible into Aramaic), and the best of these manuscripts were written with upper (babylonian) vocalization. It would be very useful to have a proposal to encode upper vocalitzation in UNICODE when beginning this project.

Best wishes,

Pere Casanellas
Treasurer of the Societat Catalana d'Estudis Hebraics
Member of the Associació Bíblica de Catalunya
Codirector of the Corpus Biblicum Catalanicum
Member of the International Organization for Targumic Studies
E-mail: pere.casanellas@btlink.net
Phone & fax: +34-934 179 000
Carrer Anna Piferrer, 11
E-08023 Barcelona