L2/11-154

Comments on Public Review Issues
(February 3, 2011 - May 5, 2011)

The sections below contain comments received on the open Public Review Issues and other feedback as of May 5, 2011, since the previous cumulative document was issued prior to UTC #126 (February 2011).

Contents:

177 Proposed Update UTS #46: Unicode IDNA Compatibility Processing
179 Changes to Unicode Regular Expression Guidelines
181 Changing General Category of Twelve Characters
182 Proposed Update UTS #18: Unicode Regular Expressions
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports


177 Proposed Update UTS #46: Unicode IDNA Compatibility Processing

No feedback was received via the reporting form this period.

179 Changes to Unicode Regular Expression Guidelines

See also L2/11-163 for more feedback on PRI #179.

Date/Time: Wed Mar 9 15:09:15 CST 2011
Contact: rsc@swtch.com
Name: Russ Cox
Report Type: Public Review Issue
Opt Subject: pri179 case-insensitive matching

A few comments about the example in PRI179.

The text given in section 3 of PRI179 is not explicit about the relative order of case folding vs class negation. It seems to suggest that the class is expanded via closing under case after it is otherwise computed, without reference to whether this is a negated character class or a regular one. This would mean that since /[\x{00}-@\[-\x{10FFFF}]/ and /[^A-Z]/ denote the same class, /(?i)[\x{00}-@\[-\x{10FFFF}]/ and /(?i)[^A-Z]/ would also denote the same class. Unfortunately, this is at odds with existing convention in tools like grep and libraries like PCRE and RE2, which which treat the case folding of [\x{00}-@\[-\x{10FFFF}], which includes a-z, as adding A-Z, but treat the case folding of [^A-Z] as subtracting a-z.

This case should be treated explicitly to make clear what is expected.

A similar problem arises for negated subclasses, both explicit ones like [[^A-Z]] and non-explicit ones like [\P{Lu}]. It is important to clarify cases like this one too. As a thunderclap, what does [^[\p{Lu}&&\p{Greek}]||[\P{Lu}&&\p{Common}]] match, and why?

181 Changing General Category of Twelve Characters

No feedback was received via the reporting form this period.

182 Proposed Update UTS #18: Unicode Regular Expressions

See also L2/11-164 for more feedback on PRI #182.

Date/Time: Sat Apr 30 15:25:29 CDT 2011
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Proposed Update Unicode Technical Standard #18

The "Script Extensions" property is mentioned in this draft, as if it were a standard property. I can find no other mention of it other than in ScriptExtensions.txt. It is not, for example, in PropertyAliases.txt, nor could I find the phrase in the Unicode 6.0 pdf. Perhaps this is because it is not a 1-1 mapping. In any event, it needs more clarification. I would like to see ScriptX be an alias for it.

The transformation of text to NFD or NFKD before regex matching only works when the case-insensitivity applies to the regex as a whole. Many engines allow parts to be matched caselessly while other parts are not. Sometimes this is the result of combining two patterns into a larger one. I think this issue should be mentioned in the document.

I agree with the idea that caseless matching not apply to most properties, as more likely being what the writer intended, and not introducing non-obvious security issues. Therefore I disagree with your proposed changes. For example ASCII_Hex_Digit otherwise matches outside ASCII under caseless matching. I think that these considerations should trump the others. There is the following statement in PRI179: "Also under that alternative approach, an implementation cannot fully resolve a character class containing properties, and then apply case-closure; instead, it must apply case-closure selectively as the character class is interpreted." This is unclear as to its intent. An implementation will need to do this selectively when first parsing the character class, yes; but this need only be done once per regular expression. The pattern can be compiled into a form that needs no further parsing. This is how Perl 5.14 works on these. It applies case closure selectively to just a few properties and compiles the result. The engine that performs matches knows only what code points are it is supposed to match--how they got there, whether caseless matching was involved or not, are all gone by then. It just has a list of code points that are to be matched at this point in the pattern.

Date/Time: Sat Apr 30 16:18:25 CDT 2011
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Proposed Update Unicode Technical Standard #18

I sent an earlier comment saying that I could find no other mention of Script_Extensions. I have since found it in UAX24, in which it is listed as provisional, which was my recollection as well. This proposed update appears to treat it as non-provisional. That discrepancy should be addressed.

Date/Time: Sun May 1 12:57:35 CDT 2011
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Error Report
Opt Subject: Proposed Update Unicode Technical Standard #18

I realized that an earlier comment I made was wrong regarding /\p{ASCII_Hex_Digit}/i. Your proposed changes don't have it matching outside ASCII; I forgot that you were also retracting the multi-character fold recommendation. I still think it will lead to undesired matches for a case closure to be constructed for caseless property matches

Encoding Feedback

Date/Time: Tue Feb 15 15:23:12 PST 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/11-052 Wingdings and Webdings symbols

Just to catch this early, for RIGHTT SPEAKER read RIGHT SPEAKER.

"No more FHTORAs!"

Date/Time: Sat Apr 23 17:10:43 CDT 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/11-101 Signwriting

I propose that the international standard numbering of the fingers used by pianists and others be adopted (FIRST FINGER, SECOND FINGER, ... FIFTH FINGER) in the names of signs, rather than the idiosyncratic names used in the proposal, which are not even universal in the English-speaking world ("baby finger" is often "little finger" or "pinkie"; "index" is sometimes "pointer"). Repetition of FINGER can and should be avoided, as in 1D8E5 SIGNWRITING HAND-FIST INDEX THUMB FORWARD INDEX BENT, which could become SIGNWRITING HAND-FIST SECOND FIRST FINGERS FORWARD SECOND BENT. (Why INDEX before THUMB, anyway?)

Date/Time: Thu May 5 12:19:53 CDT 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Error Report
Opt Subject: L2/11-149: Proposal to add Wingdings and Webdings Symbols

In my opinion, glyph w-2116 ought not to be unified with U+24FF, because w-2116 has a fixed shape and U+24FF does not. The latter could be made uniform with U+0030, for example, or with U+24EA CIRCLED DIGIT ZERO, or with the Zapf negative numbers U+2776..U+277F and/or the non-Zapf negative numbers U+24EB..24F4, according to the preferences of the font designer. This objection does not apply to the other unified glyphs.

Closed Issues

No feedback was received via the reporting form this period.

Other Reports

Date/Time: Sat Feb 12 14:28:18 PST 2011
Contact: rswihananto@gmail.com
Name: R.S. Wihananto
Report Type: Error Report
Opt Subject: Error in Javanese Unicode Code Chart

In the Javanese code chart (http://www.unicode.org/charts/PDF/UA980.pdf), the entry for U+A980 JAVANESE SIGN PANYANGGA has informative alias "= ardhacandra".

This is incorrect. The informative alias should be "= candrabindu". Please compare the code chart of Javanese script with it's sister Balinese script. Balinese script has 'ardhacandra' character, but Javanese script doesn't.

Also, in Michael Everson's 'N3319R3 Proposal for encoding the Javanese script in the UCS' document, he wrote: "The characters PANYANGGA, CECAK, and WIGNYAN are analogues to Devanagari CANDRABINDU, ANUSVARA, and VISARGA and behave in much the same way." Therefore the Unicode code chart for Javanese script needs to be corrected.

Date/Time: Mon Feb 28 15:03:56 PST 2011
Contact: bobek@boxpl.com
Name: Michael Bobeck (via Rick McGowan)
Report Type: Error Report
Opt Subject: Collation error in UCA allkeys.txt

Note: This has already been answered by Ken Whistler

I noticed that among codepoints designated as GREEK LETTER in: http://www.unicode.org/Public/UCA/latest/allkeys.txt 

2129 ; [*04E5.0020.0002.2129] # TURNED GREEK SMALL LETTER IOTA

goes long before

03B1 ; [.18DC.0020.0002.03B1] # GREEK SMALL LETTER ALPHA

while it should go after

03B9 ; [.18E9.0020.0002.03B9] # GREEK SMALL LETTER IOTA

but before

03F3 ; [.18EA.0020.0002.03F3] # GREEK (small) LETTER YOT

With best wishes - Michael Bobeck

Date/Time: Tue Mar 1 23:36:37 CST 2011
Contact: rv@rasmusvillemoes.dk
Name: Rasmus Villemoes
Report Type: Error Report
Opt Subject: Wrong classification of four symbols

Note: Ken Whistler responded to this already.

Dear Unicode

I hope this is the proper way to report this observation, and that I've chosen the right line in the drop-down box.

On page 218 of CodeCharts.pdf (version 6.0, of course), the four symbols U+2223, U+2224, U+2225, and U+2226 (DIVIDES, DOES NOT DIVIDE, PARALLEL TO, NOT PARALLEL TO) are listed under the heading "Operators", but they are clearly (binary) relations, not operators. I don't know if the classification of symbols implicitly defined by these headings is part of the Unicode standard.

Sincerely,
Rasmus Villemoes

Date/Time: Mon Mar 7 20:11:11 CST 2011
Contact: liancu@microsoft.com
Name: Laurentiu Iancu
Report Type: Error Report
Opt Subject: Inconsistent definition of IICore in the text of TUS

TUS 6.0 gives two slightly different definitions of what IICore stands for: "International Ideograph Core" on p. 397 in Chapter 12 and "Ideographic International Core" on p. 586 in Appendix E. Although the difference is minor, the same definition should be used consistently throughout the Standard.

This is not new in TUS 6.0; the same difference existed in TUS 5.0 and 5.2 online.

Regards,
L.

Date/Time: Sun Mar 13 03:38:58 CST 2011
Contact: vargavind@gmail.com
Name: Kess Vargavind
Report Type: Error Report
Opt Subject: U+1F6BB

Regarding U+1F6BB RESTROOM

I suggest that the first alias (“Man and woman symbol with divider”) as well as the referral (“U+1F46B MAN AND WOMAN HOLDING HANDS”) are removed.

The alias describe non-relevant visuals, the symbol is (at least in Sweden) as common without the divider.

Both the alias and the referral suggest that there are only two genders (which is not true in all cultures and communities), the second alias (“Unisex restroom”) is more inclusive and should be enough.

If the first alias is deemed necessary then there might be a need for further aliases such as “Man and woman symbol without divider”, “Person without depicted gender symbol” and “Intergender symbol” depending on what this symbol is encoded as.

Again depending on the intended ‘semantics’ (sorry for lack of better word) of this symbol referrals to U+1F46C and U+1F46D may or may not be wanted.

In LGBT communities it is not uncommon for this restroom symbol to be depicted with (A) a single gender neutral person, (B) one or several persons with attributes which are typically seen as degendering or overgendering [e.g. transvestites, transgenderists, intersexuals].

Best regards,
Kess

Date/Time: Mon Mar 14 07:50:49 CST 2011
Contact: krunars@gmail.com
Name: Kristján Rúnarsson
Report Type: Error Report
Opt Subject: Invalid reading

U+5AD9 (嫙) has a registered Japanese on-reading "SEB" in the Unihan database. This is in all likelihood an error for "SEN".

Date/Time: Sun Mar 27 12:08:16 CST 2011
Contact: levi.vargas@upr.edu
Name: Levi L. Vargas
Report Type: Error Report
Opt Subject: Errata in UTR25 revision 12

Errata in Unicode Technical Report #25, "Unicode Support for Mathematics" (revision 12, date 2010-10-10):

Page 17, Table 2.3: missing glyph for U+23E1 BOTTOM TORTOISE SHELL BRACKET (not present in Cambria or Cambria Math fonts, but present in Mathcad UniMath, Quivira, and Symbola fonts).

Page 27, Section 2.18 Variation Selector: instead of displaying an empty box as the glyph for U+FE00 VARIATION SELECTOR 1 (VS1), it would be best to show the Private-Use-Area presentation glyph from the font SpecialsUCS4 included with Unibook (the same glyph used in the Code Charts).

Page 30, Table 3.1: wrong codepoint value for CIRCUMFLEX ACCENT (given as 006E, should be 005E).

Page 31, Table 3.2: wrong codepoint value for FULLWIDTH CIRCUMFLEX ACCENT (given as FF4E, should be FF3E).

Date/Time: Thu Mar 31 11:33:28 CST 2011
Contact: jam@massa.com
Name: Joe McDonald
Report Type: Error Report
Opt Subject: documentation errata?

Hi!

I think there's a mistake on p. 94 of the "Unicode Standard Version 6.0 - Core Specification." The 1st sentence below Table 3-6 on p. 94 says that 9F is not a well formed 2nd byte. If I'm not mistaken, it is.

On another subject, would you know where I can find out how Microsoft Framework serializes a byte array using UTF-8 encoding?

Thanks,
Joe

Date/Time: Sun Apr 10 22:53:45 CDT 2011
Contact: zgilboa@indiana.edu
Name: Zvi Gilboa
Report Type: Error Report
Opt Subject: 05AA incorrect glyph image

Note: The Editorial Committee had an e-mail discussion of this on April 11, 2011.

Greetings!

In version 6.0 of Unicode_0590.pdf, the character 05AA has the wrong glyph associated with it. Specifically, there should be a vertical short line attached to the vertical curve in the middle.

NOTE: the character name of 05A2 (atnah hafukh), which is directed to 05AA, literally means "atnah up-side-down". Accordingly, 05AA should look like a vertical mirror image of the Atnah (0591).

WEB LINK: 05AA would be the accent at the image's LOWEST LEFT CORNER:

http://he.wikipedia.org/wiki/%D7%A7%D7%95%D7%91%D7%A5:TAMI_MIKRA_ASHKENAZ.png

Thank you for taking the time to look at this error report! Please let me know if I could be of any further assistance.

Sincerely yours,
Zvi Gilboa
Department of Germanic Studies
Indiana University

Date/Time: Mon May 2 00:10:35 CDT 2011
Contact: chrislit@berkeley.edu
Name: Chris Little
Report Type: Error Report
Opt Subject: Adriatic/South Picene confusion

The Unicode Standard 6.0, section 14.2, p. 454, makes reference to the "Adriatic" language, which should be corrected to "South Picene". Elsewhere in the section (pp. 453, 455), this language is correctly identified as "South Picene".

The use of "Adriatic" as a language name probably stems from reference to Bonfante (1996), which identifies South Picene as "Middle Adriatic" in table 23.3 on page 307. Within the text of this work, though, the language is always referred to as "South Picene"--the only term I have ever seen used in any other work discussing the language/script.

Bonfante, Larissa. 1996. “The Scripts of Italy.” In The World's Writing Systems, eds. Peter T. Daniels and William Bright. Oxford: Oxford University Press.

Date/Time: Mon May 2 00:17:51 CDT 2011
Contact: chrislit@berkeley.edu
Name: Chris Little
Report Type: Error Report
Opt Subject: U+1F633 glyph error

There is missing vertex in the glyph for U+1F633 FLUSHED FACE in the code charts. It should be obvious when viewed, but the missing vertex is the lower right corner of the fifth line across the face, counting from the top.

Date/Time: Wed May 4 05:53:50 CDT 2011
Contact: i@kevincarmody.com
Name: Kevin Carmody
Report Type: Problems / Feedback about website
Opt Subject: Error on Indic scripts FAQ page

The answer to one of the questions on the Indic scripts FAQ page http://www.unicode.org/faq/indic.html makes an incorrect statement about Vedic accents. This page is accessible from the main Unicode page under "FAQ". On the page itself, click on the sixth question on the right side of the page, "Does Uniocde cover Vedic accents?"

The answer to the question states that Unicode does not cover Vedic accents. This information is outdated. Unicode version 5.2 added Vedic accents in two new blocks, the Devanagari Extended block U+A8E0-U+A8FF and the Vedic Extensions block U+1CD0-U+1CFF.

Unicode 5.2 also added the Common Indic Number Forms block U+A830-U+A83F. It may be helpful to mention this in the answer to the 20th question, 'The Bangla "fullstop" is similar to the Devanagari danda ...', which mentions common Indic characters.