Comments on Public Review Issues

L2/09-385

Comments on Public Review Issues
(August 4, 2009 - October 28, 2009)

The sections below contain comments received on the open Public Review Issues and other feedback as of October 28, 2009, since the previous cumulative document was issued prior to UTC #120 (August 2009).

147 Proposed Deprecation of U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW

No feedback was received via the reporting form this period.

150 Draft UTS #46: Unicode IDNA Compatible Preprocessing (See: L2/09-390)

Date/Time: Thu Oct 22 10:40:29 CDT 2009
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Report Type: Public Review Issue

This is a comment about Public Review Issue #150: Draft UTS #46.

In the FAQ question about "What is "bidi label hopping"?", the example is wrong. The display of "B1.d" in a RTL paragraph is "d.1B".

A better example would be "B1.2d" which displays as "1.2dB" in a RTL paragraph and "1.2Bd" in a LTR paragraph.

Date/Time: Fri Sep 18 06:54:12 CDT 2009
Contact: tom@bluesky.org
Name:
Report Type: Public Review Issue
Opt Subject: Draft UTS #46

Re FAQ item, Why Allow ZWJ/ZWNJ at all

I totally, defer to the expertise of the authors on this subject, but I thought the basic reason these needed to be allowed was because (whatever the original intentions), they do now have semantic significance in certain scripts like Farsi/Persian, Malayalam, and Sinhala.

Other Reports

Date/Time: Fri Aug 14 17:43:44 CDT 2009
Contact: andrewcwest@gmail.com
Name: Andrew West
Report Type: Error Report
Opt Subject: Typo in note for U+01A6

The note for U+01A6 (Latin Letter Yr) states that it comes "from German Standard DIN 31624 and ISO 5246-2"

ISO 5246-2 is a typo for ISO 5426-2.

Date/Time: Tue Aug 18 16:19:35 CDT 2009
Contact: moses@blugs.com
Name: Brian Hall
Report Type: Problems / Feedback about website
Opt Subject: Typo in French data for U+02AC

Hi,

In the French localized PDF chart U0250.pdf, and in the summary chart (http://unicode.org/fr/charts/charindex.html) there is a typo for U+02AC: "BILALIALE" which should be "BILABIALE".

Thank you for your attention.

Date/Time: Mon Sep 7 11:32:39 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Other Question, Problem, or Feedback
Opt Subject: UTS18: more compact way to describe \p{blank]

Annex C: Compatibility Properties of UTS18 describes a fairly complicated way to derive the \p{blank} property. I believe it can be more simply stated as Zs + TAB. One also has to exclude ZWSP in earlier versions of TUS.

Date/Time: Tue Oct 27 09:56:16 CST 2009
Contact: lorna_priest@sil.org
Name: Lorna Priest
Report Type: Error Report
Opt Subject: annotation for U+A78B in code charts

Annotation says: * Me'phaa (Mexico)

It is using an apostrophe or U+02BC or something other than U+A78C. The language name should use U+A78C in the name rather than a curly apostrophe or whatever it is.

Feedback on Encoding Proposals

Date/Time: Fri Aug 14 20:07:59 CDT 2009
Contact: empu_pallawa@yahoo.co.id
Name: Budi Sayoga
Report Type: Feedback on an Encoding Proposal
Opt Subject: Review - Javanese Proposal N3319

Dear Team, I'd learning the javanese(carakan) N3319 (A980-A9DF). javanese script has many dependent vowel like balinese, but in this proposal I can't find 'taling tarung' (balinese = taling tedung). Decompose character is taling (A9BA) and tarung (A9B4), in javanese writing system rule taling tarung is standard character and 'tolong' is only use in cacarakan (sundanese carakan). for Accuracy and compatibility this character must be register in different address because this vowel is different then 'tolong' (A9B5). I think you must provide this character in proposal, its cant make easy to develop smart font and rendering glyph and translating languages.

regards

Closed Public Review Issues

All the feedback in the section below that is shown in gray text is feedback for TUS 5.2 or UCA 5.2 and has already been dealt with by the editorial committee, prior to the TUS and UCA 5.2 releases.

Date/Time: Thu Aug 27 12:05:52 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Issues with 5.2 candidate files

I wouldn't say these are show stoppers, but you ought to be aware of them.

1) In two places, your comments don't use U+ notation, but instead use \u. DerivedBidiClass.txt lines 16 and 29.

2) DerivedAge.txt continues to have the comment that Ken Whistler agreed was at least misleading if not outright wrong. Does it take an act of the UCT to remove a comment? If not, shouldn't you remove it now?

3) And finally, I suspect that this comment in SpecialCasing.txt is wrong: # IMPORTANT-when capitalizing iota-subscript (0345) # It MUST be in normalized form--moved to the end of any sequence of combining marks. # This is because logically it represents a following base character! # E.g. <iota_subscript> (<Mn> | <Mc> | <Me>)+ => (<Mn> | <Mc> | <Me>)+ <iota_subscript> # It should never be the first character in a word, so in titlecasing it can be left as is.

I asked on the public forum about this, and I thought the responses I got indicated it was wrong. I saw nothing in the standard to justify the comment, nor in the FAQ (and the link in the Greek FAQ to an outside page for more information isn't currently working; I don't know if it is temporary or not.) It seems to me that this comment should be removed before 5.2 is published unless doing so requires UCT action.

Date/Time: Fri Aug 28 17:44:14 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: TR44 should specify @missing syntax conventions

I don't remember if I made this comment earlier or not. The @missing comment lines are designed to be machine readable. Is there assurance that these will continue to be in the UCD and that their formats won't change in the future? and what are the conventions regarding that format? It may appear to be obvious, but I think it should be written down. And, it is unclear to me why there needs to be an extra blank field in the one in DerivedNumericValues.txt. I presume it is somehow to correspond with the field that used to be the numeric type, I believe. But why that should have to extend to a comment is beyond me, and it is the odd-one out of all the @missing lines in the UCD.

Date/Time: Sun Aug 30 11:38:59 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: @missing line in 5.2 DerivedNormalizationProps.txt

# @missing: 0000..10FFFF; NFKC_CF; <codepoint> has codepoint as one word; everywhere else it is <code point>

Date/Time: Mon Sep 7 22:05:50 CDT 2009
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: TR10-19: broken link for reference [SortAlg]

The link proposed for background on the names and characteristics of different sorting methods, (see [sortAlg]) is now broken: the domain name is no longer assigned and is cybersquatted.

Some other alternate links can provide such info: - http://en.wikipedia.org/wiki/Sorting_algorithm# (notably the table found in the "Comparison of algorithms" section which summarizes some characteristics). - http://www.dcc.uchile.cl/~rbaeza/handbook/sort_a.html "Sorting Algorithms", in "Handbook of Algorithms and Data Structures", chapter 4, by Gaston H. Gonnet (Informatik, ETH Zurich) and Ricardo Baeza-Yates (Dept. of Computer Science, Univ. of Chile). - http://www.softpanorama.org/Algorithms/sorting.shtml many other useful links about sort algorithms and their implementation in various languages.

Date/Time: Tue Sep 8 00:03:37 CDT 2009
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: TR10-19: 3.2.1 File Format: collation weight values

The syntax for the default DUCET table (in plain text format) should be more appropriately documented, notably for these lines describing the entries:

<entry> := <charList> ';' <collElement>+ <eol>
<collElement> := "[" <alt> <char> "." <char> "." <char> ("." <char>)* "]"
<alt> := "*" | "."

The references to <charList> and <char> is not described, and even its name is misleading. I would write instead:

<entry> := <charList> ';' <collElement>+ <eol>
<charList> := <char> (SPACE <char>)*
<collElement> := "[" <alt> <weight> "." <weight> "." <weight> ("." <weight>)* "]"
<alt> := "*" | "."
<char> := <hexdigit><hexdigit>*
<weight> := <hexdigit><hexdigit>*

making a clear distinction between <char> and <weight> values (the first one are standard, the others are subject to variation across versions of the DUCET and Unicode itself when new characters are added to this table, or their default relative collation order is updated/corrected).

Currently, all weight values found in the DUCET's "allkeys.txt" file use at least 4 digits (in fact always 4 for the first 3 weight values, and up to 6 digits for the fourth weight (to take into account the possible values of code points). I think it is a waste of space, notably because the current version already describes that the second and third weights are limited in range: 10 bits for the second weight (i.e. at most 3 hex digits), and 6 bits for the third weight (i.e. at most 2 hex digits): this means that too many unnecessary zeroes are present throughout this giant file.

It would then be good to specify that weight values in the text format of the DUCET will use variable width (this width may change in future versions, such as here where the secondary weights were extended from 8 to 10 bits, requiring 3 significant hex digits, and the third weights were reduced from 8 to 6 bits, without affecting the number of significant hex digits). These value range limits for weights in the DUCET should be reflected in the format.

So instead of writing :

0020 ; [*0209.0020.0002.0020] % SPACE
02DA ; [*0209.002B.0002.02DA] % RING ABOVE; COMPATSEQ

Why not simply reduce the lines to just what is needed to process the file according to these limits (i.e. just 3 digits for the secondary weight, and 3 digits for the third one)?:

0020 ; [*0209.020.02.0020] % SPACE
02DA ; [*0209.02B.02.02DA] % RING ABOVE; COMPATSEQ

Or even suppress all the unnecessary leading zeroes in the collation weights (even if they are kept in the leading codepoint values?:

0020 ; [*209.20.2.20] % SPACE
02DA ; [*209.2B.2.2DA] % RING ABOVE; COMPATSEQ

Are there really applications that expect the fixed width format for parsing the "allkeys.txt" file, when they are in fact preferably using a compressed binary file, or an alternate representation (like in Java or ICU syntaxes) where the effective table will be built dynamically?

Is it also necessary to keep the extra spaces before the comments and before the semicolons? The following is equally readable:

0020;[*209.20.2.20]% SPACE
02DA;[*209.2B.2.2DA]% RING ABOVE; COMPATSEQ

Anyway, I am still not convinced that all the entries in the allkeys.txt file are effectively sorted according to the SHIFTED order preference, as stated in the HTML documentation. Has it been really tested?

Why isn't there also documentation for the extra tags at end of comments? Some of them have disappeared (like COMPATSEQ shown in the HTML documentation), some others are used (suing acronyms after a "QQ" prefix): I think that they are just there for manual editing; or tracing changes (and that's why I suggest that the order of entries in that file may need some automated testing).

Finally, the TR10-19 document proposes a link to the updated (beta) version of the allkeys.txt file, but the link is broken (only the 5.1 version is available, including on the FTP site). MAy be this file is currently being regenerated due to bug submissions or other pending corrections.

Date/Time: Tue Sep 8 23:19:20 CDT 2009
Contact: doug@ewellic.org
Name: Doug Ewell
Report Type: Public Review Issue
Opt Subject: Ordering of Tamil and Malayalam in UCA

The announcement of the Public Review issue stated:

1. The data files contain weights for all new assigned characters.
b. The ordering for Tamil and Malayalam has been improved,
but would still need tailoring for the Tamil and Malayalam
languages.

I guess I'm puzzled why the default order for these two scripts wouldn't match the overwhelmingly dominant language written in those scripts. It's often stated that the default ordering for Latin also isn't appropriate for any language, but that's more understandable since so many languages are written in Latin.

I don't claim to be an expert in either Tamil or Malayalam.

L2/09-385