Comments on Public Review Issues

The sections below contain comments received on the open Public Review Issues as of June 7, 2004, since the previous cumulative document was issued prior to UTC #98 (Feb 5, 2004). Also included are other commentaries received via the public Reporting Form during this period, especially on Phoenician.

13 Unicode 4.0.1 Beta (already closed)

Date/Time: Tue Feb 10 04:31:01 EST 2004
Contact: Francis Boxho


This message concern the Public Review Issue Nr 13, Unicode 4.0.1 Beta. It was supposed to be closed on 2004.01.27, so this message is late. But I discovered your site only some weeks ago, and the concerned file only yesterday. The concerned file is PropertyAliases.txt (or, to be more precise, PropertyAliases-4.0.1d3b.txt). You say in beta.html that "The property aliases have also been rearranged into somewhat more meaningful categories." I do not want to discuss their meaningfulness, but the fact that they mix and confuse several concepts.

"Numeric", "String" and "Binary" is a classification based on the type of data of the values of these properties. "Miscellaneous" and "Catalog" describe the type of the properties, not the type of data of their values. "Enumerated" simply indicates that the list of possible values is limited and closed.

In fact:

- all enumerated properties are alse either numeric (ccc) or string (all the others);
- binary properties are also enumerated, with the list of values limited to two members (Yes/No or True/False);
- catalog and miscellaneous properties are string properties, but Unicode_Radical_Stroke could alse be considered as the combination of two numeric properties.

So, I think that the new set of categories is more confused than meaningful. You have already a better set of categories at your disposal, i.e. the convenient grouping of categories according to their usage, in UCD.html. Any kind of group or category can be used, even with two or more levels, but it should be conceptualy accurate, and should not mix different concepts at the same level.

With my best regards.
Francis Boxho

P.S. Sorry for the typos and mistakes, but my mother tongue is French, not English.

20 Draft UTR #31 Identifier and Pattern Syntax

Date/Time: Tue Jun 1 01:46:53 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

The section on programming language identifiers should not only be moved to this report (from UAX 15), but should also be changed, to not potentially give the (unrealistic) expectations that tools such as compliers should normalize identifiers, or eliminate formating characters.

25 Proposed Update UTR #17 Character Encoding Model

Date/Time: Tue Jun 1 01:52:08 CDT 2004
Contact: Martin Duerst

Opt Subject: Issue 25: Proposed Update UTR #17 Character Encoding Model

This is a comment on behalf of the W3C I18N WG.

We have unfortunately not yet had the time for a full review of this document, but we plan to do such a full review. Here are the comments we already have: It would be very helpful if terminology would be more alligned with that of other standards bodies. In particular, the term "Character Map" seems to need reconsideration, because nobody actually uses it. The reference to the W3C Character Model should be updated. You may be interested at looking at the Last Call comments that we got on the W3C Character Model, because some of them may apply to your document, too.


27 Joiner/Nonjoiner in Combining Character Sequences

Date/Time: Thu Feb 12 19:27:19 EST 2004
Contact: Peter Kirk

I am disappointed that Public Review Issue #27 has been only partially resolved, in that "The interpretation of joiner/nonjoiner between two combining marks is not yet defined." I strongly supported the original proposal, according to which ZWJ or ZWNJ between two combining marks would affect the rendering of those two marks. I did not formally express this support because I understood the public review issue as concerned only with the choice between options A and B (on which I had no particular opinion), and that the main principle was not being reviewed.

There are specific cases where ligatures may be made between combining marks associated with the same base character, analogous to ligatures between base characters, and there is a need for a mechanism to control ligation. One such example is that (in some typesetting traditions) the Hebrew mark meteg generally combines with certain Hebrew vowel marks, but is also sometimes written separately. Another possible example is with IPA contour tones written above the character; to avoid a possible proliferation of tone contours it might be sensible to define these contours as ligatures of acute, grave and macron.

I would like to encourage the UTC to reconsider the issue which was left "not yet defined" and to accept the principle that ZWJ and ZWNJ may be used to control ligation between combining marks in specific defined instances (and should be ignored when used between other combining marks). I intend to present to the UTC a proposal for at least one such specific instance.

Peter Kirk

29 Normalization Issue

Date/Time: Tue Jun 1 01:54:52 CDT 2004
Contact: duerst@w3.org

This is a comment on behalf of the W3C I18N WG.

We fully support this. It's a pity there was a bug in the text, but fixing it is much better than leaving things inconsistent. The exact effect on various versions of the Unicode standard should be clarified, in a way that does not adversely effect third-party specs that are referring to previous versions of Unicode automatically.


30 Bengali Khanda Ta

Date/Time: Mon Feb 23 22:20:40 EST 2004
Contact: Ernest Cline

One consideration that the review document failed to address was the ease of converting Bengali data to Unicode from ISCII and vice versa. This clearly must be considered a con for both Models B and D.

In the case of Model B, it introduces a fourth hasant variant beyond the normal, explicit, and soft that is present in ISCII. Unless ISCII were prepared to introduce a fourth hasant form (perhaps with hasant, hasant, nukta) this would introduce a complication in conversion to ISCII.

In the case of Model D, the introduction of an extra character also presents difficulties in representing in ISCII as there is no khanda-ta character in ISCII. Even worse, unless the existing conventions of how to convert data from ISCII to Unicode are changed for this one exception, data converted to Unicode will not use the Khanda-Ta at all.

On the other hand both Models A and C have the advantage of ease of conversion to and from ISCII, as both use the three usual hasant variants.

The issues raised in the review document were enough to convince me that models C and D were not desirable, but were not enough to cauase me to favor eother model A or B over the other. This additional point that I am raising, is sufficient to cause me to favor Model A.

From: Manoj Jain
Date: 2004-05-19 23:33:00 -0700

Dear All,

Most of the Bengali experts recommend that "Bengali Khand Ta" should be encoded separately in the Unicode Standard.


Manoj Jain

Government of India
Ministry of Communications & IT
Department of Information Technology
New Delhi 110003
Phone +91-11-24301240 Fax +91-11-24363076

----- Forwarded by Manoj Jain/doe on 20-05-2004 09:57 -----

From: Bidyut Baran Chaudhuri <bbc@isical.ac.in>
To: mjain@mit.gov.in
Subject: Bengali Khanda-ta
Date: 19-05-2004 19:14

Dear Mr Jain

This is about your enquiry on Bengali Khanda-ta coding in UNICODE format. Myself and my colleagues here think that Khanda-ta should be encoded as a separate character. This is because it will help in both scientific (eg NLP, Computational linguistics) and commercial applications. The typist will find it convenient to type it in a single keystroke. Moreover, by alphabet convention of Bengali script, it is treated as a separate character.

You may kindly forward our views to the appropriate authority. Regards and

best wishes.----B B Chaudhuri
Prof. B. B. Chaudhuri FIAPR, FIEEE
Jawaharlal Nehru fellow
Computer Vision and Pattern Recognition Unit
Indian Statistical Institute
203, B. T. Road
Kolkata - 700 108
E-mail: bbc@isical.ac.in
FAX: (91) (33) 2577 3035
Phone: Office:- (91) (33) 2578 1832
(91) (33) 2575 2852

Date/Time: Tue Jun 1 01:56:52 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

We do not think that we need to comment on this issue.

Date/Time: Thu Jun 10 21:04:19 CDT 2004
Contact: Stefan Baums

(formatted by editor in Arial Unicode MS due to need for various Indic letters & transliterations)

I agree with the conclusion of Peter Constable’s PRI paper “Encoding of Bengali Khanda Ta in Unicode”. The current encoding model should be kept, but the descriptive wording of the Standard improved (model A). Model A correctly captures the fact that khaṇḍa ta is equivalent to an overt‐virāma form, both in historical origin and in modern clustering behaviour.

Khaṇḍa Ta should not be represented with obligatory ZWJ (model C) because the choice between khaṇḍa ta glyph representation and conjunct glyph representation depends on the capabilities of the font. Khaṇḍa Ta is a matter of glyph presentation, not of encoding. In addition, this would result in very heavy use of ZWJ in the encoding of regular Bengali text, and it is my impression that the ZWJ character was meant for requesting special behaviour in exceptional situations, not for constant use in the regular course of character coding.

Khaṇḍa ta should also not be represented as a separate character, for the reasons given above: it is the equivalent of an overt‐virāma form. Based on what native speakers have said on the Indic mailing list, it would seem to be the case that khaṇḍa ta is taught as a separate letter in Bengali primary schools. While this could be used as one argument among others in building a case for encoding as separate character, it is not on its own decisive. We know that in ancient India and right up into the period of the modern scripts, the consonant clusters क्ष (kṣa) and ज्ञ (jña) were regarded as separate letters, but nobody has ever suggested encoding them separately in Unicode. The intuitions of native users can be misleading.

Another important argument against encoding khaṇḍa ta as a separate character is overall consistency in the Unicode representation of Indian writing systems. If khaṇḍa ta were encoded as a separate character, then it would be the only consonant character in the Bengali script that does not have a short a inherent, and indeed the only such consonant character in any Brāhmī‐derived script. Breaking with such a fundamental property of the Indic encoding model should not be done lightly, even if there were good overriding arguments in the khaṇḍa ta case, which in my opinion there are not.

Concerning the Background section of the PRI document, I have two suggestions:

1. There is a rather vague reference to Chatterji 1926 that claims that according to him “ta‐hasanta was preferred for indigenous Bengali words ... in contexts in which conjunct forms would occur for loans from Sanskrit, Persian or other languages.” That may well be so, but I was not able to locate this claim in Chatterji’s orthography chapter. One should add a page reference to this sentence of the PRI paper or any derivative thereof.

2. The sentence at the bottom of page 1 (“... khanda ta is not used in older texts, and would not normally be expected in Sanskrit‐language documents”) is not only wrong, but directly contradicts the sentence quoted from Chatterji 1926 as well as the immediately preceding paragraph. It is my understanding that khaṇḍa ta is used in Sanskrit texts wherever a conjunct was not available, i.e., the usage sphere of khaṇḍa ta would be the same as that of Nagari’s half‐form and overt‐virāma form of ta taken together. As a matter of fact, I would expect khaṇḍa ta to occur rather more frequently in Sanskrit words than in real (tadbhava) Bengali words. This is because at the Middle‐Indo‐Aryan stage of language development, wide‐ranging consonant cluster assimilation removed all instances of t + another consonant in favour of homorganic clusters.

Best regards,
Stefan Baums
Asian Languages and Literature
University of Washington

31 Cantonese Romanization

Date/Time: Thu Mar 25 12:57:34 EST 2004
Contact: hardie@oakthorn.com

I write to comment on Issue 31, Cantonese Romanization (2004.06.08).

I do not believe that adopting the jyutping romanization is in the interests of the largest number of users of the unihan data. The Yale romanization is widely used in the teaching of Cantonese, and shifting to a different romanization for the unihan data set will make it difficult for both teachers and students. It would be much clearer for the unihan data set to remain as close to the Yale romanization as possible.

You note that Cantonese linguists prefer jyutping; while this may be true, it also a young romanization, without the decades of use that the Yale romanization has seen. Today's preference may fade as the warts of the new system begin to show. It has not, in short, had the decades of use that the Yale romanization has seen, and the preference may simply be because its elements are not as well known. Jyutping does have some interesting aspects and as specialist tool for linguists it may be a good choice, but the romanization chosen for unihan should be striving for wide utility, not for interest. That wide utility is clearly in the Yale romanization.

best regards,
Ted Hardie

Date/Time: Fri Mar 26 04:18:48 EST 2004
Contact: John Clews

Dear Rick

You wrote via the JTC1/SC2 list:

> The Unicode Technical Committee has posted a new issue for public
> review and comment. Details are on the following web page:
> http://www.unicode.org/review/
> ... Briefly ... we plan to adopt a single, standard Cantonese
> romanization for use throughout the Unihan database.

I strongly agree with the recommendation of the Unicode Consortium that it would be better to adopt the new jyutping romanization developed by the Linguistic Society of Hong Kong <http://cpct92.cityu.edu.hk/lshk/>.

This would result in much more consistency at the expense of very few changes. I have also recorded these comments via http://www.unicode.org/reporting.html

John Clews,
Former Chair of ISO/TC46/SC2 (Conversion of Written Languages)
which deals with transliteration and transcription issues in ISO.

Date/Time: Mon Mar 29 06:23:08 EST 2004
Contact: Kent Karlsson

re review issue 31 (Cantonese romanisation) I know nothing about the issue... But the line matching aap to aan does look odd. I guess it's a typo.

Date/Time: Tue Jun 1 01:57:57 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

This seems like a good idea. It doesn't concern us.

Date/Time: Sun Jun 6 06:22:45 CDT 2004
Contact: Adam Sheik

Hi, I have read the public review page at: http://www.unicode.org/review/pr-31.html

I just wanted to add my support for adopting Jyutping instead of Yale. I run what I believe is the most popular Cantonese learning website on the Internet (www.cantonese.sheik.co.uk, ranked #1 in Google for "cantonese" and "cantonese learning") and I have had a lot of feedback from all over the world regarding which romanisation scheme to use. It seems Yale is slightly easier for English speakers to use, notably from the UK and USA, but jyutping is easier for most Europeans. In my experience, Jyutping generally only takes English speakers about 10-15 minutes to learn. Jyutping also offers a few more tangible advantages, as it distinguishes between certain sounds where Yale does not.

Finally, it is far easier to use tone numbers instead of diactrics on the Internet, so whether you choose Jyutping or Yale, please consider using tone numbers.

By the way, you may like to know that my site has been fully converted to unicode this weekend. I had a few difficulties but everything is now working well. You can read about it here:

http://www.cantonese.sheik.co.uk/phorum/read.php?f=1&i=11901&t=11901 and : http://www.cantonese.sheik.co.uk/phorum/read.php?f=1&i=11992&t=11992 and (technical details): http://www.cantonese.sheik.co.uk/phorum/read.php?f=2&i=1013&t=1013

Best Regards,
Adam Sheik - webmaster for www.cantonese.sheik.co.uk

Date/Time: Mon Jun 7 16:19:21 CDT 2004
Contact: Helmut Lalla

1.) The decision to adopt a standard romanisation for Cantonese is a good decision.

2.) Jyutping and Yale are both good romanisation systems, in fact the two best available, and so whichever one of the two you will choose, it will be a good decision.

3.) If you are looking for the most wide spread romanisation, Yale is the choice. Especially concerning text books, Yale is widely used and Jyutping is not. Jyutping has gained some popularity though in the internet.

4.) If you are looking simply for the best romanisation, Jyutping is the choice. Advantages are:
* Yale has some inconsistencies, final -a instead of -aa, and dropping of initial y- if followed by the vowel -yu-.
* Jyutping distinguishes the vowels -eo- and -oe- which are very different.
* Yale's choice of -eu- for these vowels is unlucky, because it blocks this letter combination to be used for the rare diphtong that results by speaking -e- followed by a -u. While Jyutping makes the obvious choice to write -eu for this diphtong, there is no standard Yale way of writing it. I have seen -el and -ehu, which are both not at all straightforward.
* Rare syllables are generally defined only in the Jyutping standard.
* Jyutping's choices for representing Cantonese sounds follows international standards. Yale retains some specific English usages, especially j- and ch-.
* Standard Yale is defined as using the letter -h- and diacritics for indicating tones. Jyutping uses numbers which is more computer friendly.


32 Proposed Update UTR #23 Character Property Model

Date/Time: Tue Jun 1 02:03:22 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

We did not yet have time for a full review, but we plan to do a full review soon. The definition of String as a sequence of code units (rather than characters) is rather strange. Either the report should change the definition, or use a different, more precise, term. Also, PD10 contains an extra comma after "that".


33 UTF Conversion Code Update

Date/Time: Tue Jun 1 02:05:14 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

We did not check this.


34 Draft UTS #35 Locale Data Markup Language

Date/Time: Tue Jun 1 02:18:37 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

We plan to review this report in time for the next review cycle. We are somewhat concerned about the dependency of the locale data hierarchy on the implicit hierarchy in the identifier.


35 Encoding of LATIN SMALL LETTER C WITH STROKE as a phonetic symbol

Date/Time: Thu Apr 29 04:18:20 EDT 2004
Contact: D. Starner

In addition of a LATIN SMALL LETTER C WITH STROKE, there's also a LATIN CAPITAL LETTER C WITH STROKE, found, for example, in the Bureau of American Ethnology reports. So the cent sign and the letter must be deunified so proper casing can be done.

Date/Time: Thu Apr 29 13:26:17 EDT 2004
Contact: John Cowan

The proposed C WITH STROKE should be encoded separately, despite the similar appearance to CENT SIGN. The use of a cent sign in place of a c-with-stroke is a simple font approximation, analogous to 7 for TIRONIAN SIGN ET, ? for GLOTTAL STOP, and my own use of CAPITAL LETTER OPEN E with COMBINING LONG SOLIDUS OVERLAY for handwritten AMPERSAND. Abusus non tollit usum.

Date/Time: Thu Apr 29 19:55:32 EDT 2004
Contact: Philippe Verdy

The document forgets to consider other possible legacy encodings of this character, notably if it is already used with case mappings.

The table of possible legacy encodings should include the possibility that it is already encoded with Unicode using Latin letter c or C, followed by a combining solidus overlay, which would not have the problem of the CENT sign.

However, as it is not clear whever the legacy encodings may have chosen a combining slanted solidus overlay or a combining vertical bar overlay, due to presentation forms for the same character, the proposal to encode the character isolately may be useful to allow freedom in its presentation, without depending too much on the slanted or vertical presentation of the combining overlays.

The document exposes the case that phonetic characters used to write languages without a accepted orthograph will sooner or later evolve to normal uses of the character as a plain latin letter, including a uppercase version. Casing is a standard feature of the Latin script and is used very often as a matter of style for the presentation of book and chapter titles, or as an emphasis style (including the smallcaps style), or sometimes required for the presentation of some documents (notably for postal addresses on envelopes, and administrative forms).

So encoding the proposed character with gc=Ll will make it suitable for later additional support of an uppercas e version. Still, not proposing the uppercase version of the character will not make it a true Latin letter for languages with an accepted orthography, as it would cause problems if one wants to use it properly for toponyms, trademarks, people names, etc... where uppercase would be needed. If this creates a problem immediately, people will start by rejecting the proposed Unicode character as it will complexify the case mappings (the lowercase would be encoded isolately, but the uppercase woul have to be emulated with C + a combining solidus overlay, or even worse with just a capital C).

For semantic preservation with case folding operations, it seems reasonnable to include then both the lowercase and uppercase version (and add a note so that the uppercase version will not even be unified with the CEDI currency sign also proposed recently).

The same remark would apply for the African R-barred and U-barred, or W-barred, which are used in Niger, Cameroun, and Congo (Kinshasa, former Zaire): some of them only exist in lowercase version and the lack of an uppercase version or their absence is already a problem. (Note the ressemblence of W-barred with the Won currency sign... another hack that has been used to approximate the missing character, simply because there's no other workable solution to print this character)

Date/Time: Wed May 26 14:55:21 CDT 2004
Contact: John Koontz

I do not know if I understand all of the principles governing Unicode encoding well enough to offer an appropriate argument on the slashed c encoding issue (http://www.unicode.org/review/pr-35.pdf) either way. However, I can add that the US Bureau of American Ethnology, a precedecessor to the US National Anthropological Archives, used slashed c to represent the edh character, and that this usage is embedded in BAE orthography in, e.g., the BAE and Contributions to North American Ethnography series, in the work of James O. Dorsey on the Siouan languages. For example, the Dhegiha (or Omaha-Ponca) language is referred to as C/egiha. Capitalized and lower case versions are used. I can provide more precise citations of examples if this is desired. BAE orthography is a dead issue at present, but an interesting and useful body of Americanist literature on American languages is encoded in it. Slashed c is not the only character employed there, though I think that most can be represented in Unicode with floating diacritics. Maybe not "turned" or "inverted" letters, e.g., ptksc, cent-sign, c-cedilla, h and perhaps a few others.

Date/Time: Wed May 26 15:58:27 CDT 2004
Contact: Julian Bradfield

I am a computer scientist with an interest in character encoding issues, and I also maintain a strong interest in phonetics and phonology, and will be working in the area shortly.

I wish to support the decision to make latin small letter c with stroke a separate character from cent sign.

The arguments in favour are valid; and moreover, the two characters are conceptually quite different. As a non-American, it would not even have occurred to me that slashed-c might be the same as the cent sign, although I am of course used to seeing slashed-c as one variant of the cent sign.

The legacy encoding argument seems weak to me. The use of this character is limited, as far as I know, and while there is almost certainly some data that codes c-slash as cent in some legacy encoding, I find it hard to believe that the amount of such data is sufficient to outweight the future inconvenience caused by unifying c-slash and cent.

Julian Bradfield, School of Informatics, University of Edinburgh.

Date/Time: Wed May 26 21:53:12 CDT 2004
Contact: Albert Bickford (SIL)

The arguments for unification of LATIN SMALL LETTER C WITH STROKE with CENT SIGN are mostly concerned with data conversion issues. The users affected are primarily going to be a relatively small group of specialists (linguists using the symbol for phonetic transcription according to the Americanist tradition) who are familiar with the need for data conversion, and will have access to means of converting their data. Data conversion will not be a big problem for most of them, and so the arguments that there is a need to maintain compatibility with past encodings are not very strong.

As one of those users, I would prefer having a separate character which reliably had the correct glyph and character properties. I would not want to be forever hamstrung in use of this character by an attempt to maintain compatibility with what would clearly be regarded as our past makeshift representation in legacy encodings (even if those encodings were standard ones).

So, I would argue against unification, and for encoding this as a separate character. Let's do it right and discard the past in this case.

Date/Time: Thu May 27 12:52:06 CDT 2004
Contact: James L. Fidelholtz

Peter Constable (http://www.unicode.org/review/pr-35.pdf) gives arguments for and against using this symbol as distinct from the 'cent sign' and other possibilities already incorporated within Unicode. In this case, I consider the argument in favor of the existence of capital letter variants to be crucial for the adoption of the proposal.

More generally, I find it somewhat disturbing that the issue even arises, since in my understanding of Unicode as a sort of universal encoding for *all* letters and symbols for *all* writing systems, it seems to me that it should be generally inclusive, excluding symbols *only* if there are VERY strong arguments against them (which I cannot conceive for any case, but am prepared to admit could possibly exist). If the intention is truly to have a *universal* and *standard* coding system, which I strongly support, then it *must* be *inclusive*.

If the 'same' symbol is encoded in different sets in different ways, this can only make it easier to use in different ways for different people. There is no reason to arbitrarily exclude symbols, as far as I can see, for, really, *any* reason, and much less if there are even moderately strong arguments in their favor, as there are in the present case.

James L. Fidelholtz

Date/Time: Fri May 28 09:05:16 CDT 2004
Contact: Rory Larson

I work with the Omaha-Ponka language. A great deal of OP material was recorded in the 19th century by the missionary James Owen Dorsey. Dorsey used the c with slash character for a special phoneme in OP which I call "ledh". This is a non-continuous sound which starts as [l] with the tongue curled up to the alveolar ridge. The point of articulation then slides down the back of the upper front teeth and drops off the bottom as an edh sound. Thus, it is somewhere between an [l], an edh, and an apically rolled [r]. The modern written form of the language uses the "th" digraph, but it would be nice to have a single character to represent this. In that case, a capital form as well as a lower case form would be needed. I don't know if there is any other recognized phonetic symbol for ledh; perhaps the Japanese [l/r] sound is similar? In any case, a conversion from the Dorsey corpus would probably require manual retyping anyway, so legacy issues should not be a problem.


Date/Time: Mon May 31 14:58:47 CDT 2004
Contact: Doug Ewell

I support the separate encoding of LATIN SMALL LETTER C WITH STROKE (as well as its uppercase counterpart)and its implicit disunification from U+00A2 CENT SIGN.

The question of encoding this letter seems comparable to the questions years ago of encoding U+01BC LATIN CAPITAL TONE LETTER FIVE as distinct from '5', U+01C0 LATIN LETTER DENTAL CLICK as distinct from '|', and U+0222 and U+0223 LATIN * LETTER OU as distinct from '*'. In each case, the identity of the character as a letter, with letter properties, outweighed the potential for legacy transcoding problems.

It seems unlikely that there are large amounts of legacy data for these Americanist transcriptions that use both CENT SIGN and LATIN SMALL LETTER C WITH STROKE such that disambiguation would become a problem.

-Doug Ewell
Fullerton, California

Date/Time: Tue Jun 1 02:21:39 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

Any consideration for encoding this with U+0338 (COMBINING LONG SOLIDUS OVERLAY) seems to be missing. Given the standing policy that no more precomposed letters are being encoded, there may be no need at all to encode this letter. The samples in the pdf document all show slanted strokes (rather than vertical), even in the Roman font example, but there is no discussion of the possibility of using a combining character.


36 Draft Unicode Technical Report #30 Character Foldings

Date/Time: Tue May 25 16:52:07 CDT 2004
Contact: John Cowan

I am proposing a draft version of DiacriticFolding.txt, temporarily located at http://www.ccil.org/~cowan/DiacriticFolding.txt. A list of characters *not* appearing in the file is at http://www.ccil.org/~cowan/DiacriticFoldingExceptions.txt

Date/Time: Tue May 25 12:04:27 CDT 2004
Contact: dominikus at scherkl.de

The definition of [KD] in the Note to the folding-table (4.2) contains a duplicate half-sentence.

Phoenician (L2/04-141)

Date/Time: Thu Apr 29 08:28:50 EDT 2004
Contact: Peter Kirk

Michael Everson has made a proposal, N2746, for encoding the Phoenician script in the UCS. The principle of encoding Phoenician separately from Hebrew has been discussed at length e.g. on the Unicode Hebrew list, and remains highly controversial. Indeed it seems to have won little support in these discussions apart from that of the current proposer. The general scholarly practice is to encode Phoenician, paleo-Hebrew etc as Hebrew script with variant glyphs. A change to using a separate Phoenician script will be disruptive and will compromise existing encoded texts. The user community is apparently far from convinced that the negative effects of this change will be outweighed by any claimed benefits.

In section C point 2a of the proposal the proposer states that no contact has been made with the user community. In fact there has been some contact, at least on the Unicode Hebrew list, but the users contacted have not been in favour of the principle of the proposal.

Date/Time: Thu Apr 29 15:30:26 EDT 2004
Contact: John Cowan

I believe that it is inappropriate to encode Phoenician script at this time. The Roadmap provides for no less than 8 copies of the same 22-character West Semitic abjad (viz. Hebrew, Mandaic, Samaritan, North Arabic, Palmyrene, Nabataean, Phoenician, Aramaic). Before any of these other than Hebrew are encoded, we need to have a systematic justification for making precisely these cuts in the complex Semitic family tree and no others. Saying simply "Adherence to the Roadmap" does not cut it. (Greek, Arabic, Syriac, and Indic, though also descendants of Phoenician, are not relevant because they are no longer 22-character abjads).

In particular, if all of these are encoded using the Hebrew block, they will "just work" without any further implementation effort, since none of them require any treatment different from that applied to the subset of Hebrew characters represented by the base characters excluding final forms. This is a real advantage to users. An affirmative defense is needed for disunifying these scripts from Hebrew.

Date/Time: Mon May 10 12:24:59 CDT 2004
Contact: John Cowan

I wish to withdraw my remarks opposing the encoding of Phoenician as a separate script.

I also urge the UTC to collate Hebrew and Phoenician scripts jointly in the default collation, so that aleph and alaph are given the same primary weight, beth and beth, etc. etc.

Other Hebrew Issues (L2/04-213)

Date/Time: Wed Jun 9 09:42:40 CDT 2004
Contact: Peter Kirk

This is a response from Peter Kirk to Jony Rosenne's submission "Responses to Several Hebrew Related Items", L2/04-213(http://www.unicode.org/L2/L2004/04213-rosenne.pdf

I appreciate Jony Rosenne's comments on the these Hebrew related items.

My position on the Phoenician proposal is already clear from L2/04-206. If the proposal for a new script is accepted despite the position against it of scholars of north-west Semitic script, then Rosenne's second paragraph becomes an important observation.

I agree with Rosenne's comments on Meteg. On Qamats Qatan, I agree that this is a glyph variant of Qamats and should be treated as such. The most appropriate mechanism would appear to be a Variation Selector, but this depends on an extension of the currently defined mechanism to support variant glyphs of combining characters. An acceptable alternative might be a new character with a compatibility decomposition to the existing Qamats.

On Holam, Rosenne rightly points out that this is an important plain text issue which must be addressed by the UTC. His support for Option B1 in my proposal (http://qaya.org/academic/hebrew/Holam.html) seems theoretically neat, but the UTC should not entirely avoid considerations of implementation feasibility. The difficulty is that implementation of this option requires the rendering engine to position a glyph according to the phonetic environment of the sound represented by the character, and not only the graphical environment of the glyph. This is well outside the intended scope of rendering engines, although it may just be feasible for some engines to distinguish the environments commonly encountered in practice. The encoding with ZWNJ which is my Option B2 avoids this requirement for the rendering engine to determine the phonetic environment, by distinguishing the two positions of the glyph by the presence or absence of ZWNJ. This also ensures that the glyph is positoned correctly even in some rare cases, e.g. the divine name as discussed below, where the method of determining it from the phonetic environment breaks down.

My most significant comments here are on the section of Rosenne's submission entitled "Qere and Ketiv". It seems to me that there is a basic misunderstanding in this section. Rosenne seems to hold that Unicode should not seek to represent the actual form of the pointed text of the Hebrew Bible, as it has been presented in manuscripts and printed editions for more than 1000 years, but only either the unpointed text (Ketiv) or the form which is to be pronounced, as reconstructed from marginal notes (Qere). But Unicode is supposed to represent written texts, not their pronunciation. So the forms which should take precedence are those which are actually found on paper.

Fortunately, there is in fact much less problem in representing these forms than Rosenne seems to suggest. There are some cases in which Hebrew points appear as spacing diacritics, either at the beginning of a word or in words which have no base characters, but the Unicode representation for this is well-known: the combining marks are combined with NBSP (or SPACE, but the latter is inappropriate here as the word must not be broken). There are also cases of two vowel points combined with one base character, but the issues here have already been considered by the UTC, in August 2003, and the principle was accepted that the vowel points can be separated by CGJ to avoid inappropriate canonical reordering. There are some challenges here for rendering, but none for Unicode representation.

The rare forms of the divine name whose correct pointing causes a problem, as in the right hand image in Figure 4 of my Holam proposal, are probably technically cases of "perpetual Qere" and so not pronounced as written - although some in fact hold that the pronunciation as written (YEHOVAH) is correct. Nevertheless, the form as written, complete with anomalous position of Holam, is printed in a standard scholarly text (Biblia Hebraica Stuttgartensia) and is of special religious significance to some. It should therefore be supported in plain text. This is further evidence that the simple form of Option B1 of my Holam proposal is inadequate.