Comments on Public Review Issues

L2/12-347

Comments on Public Review Issues
(July 25, 2012 - October 31, 2012)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 24, 2012, since the previous cumulative document was issued prior to UTC #131 (May 2012). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Gray items in the Table of Contents do not have feedback here.

Issue Name (+ feedback links)

207 Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout Moderated

228 Changing some common characters from Punctuation to Symbol

232 Proposed Update UAX #9, Unicode Bidirectional Algorithm (draft HERE)

233 Proposed Update UTR #20, Unicode in XML and other Markup Languages (draft HERE)

The links below go to locations in this document for feedback.

Feedback on Encoding Proposals
Closed Public Review Issues
Error Reports
Other Reports

Feedback on Encoding Proposals

Date/Time: Mon Oct 29 17:20:24 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-333 Request to UTC to Propose 226 Characters for Inclusion in CJK Extension F

Several of the characters listed in this proposal are annotated "Variant form of ..."
These would seem to be candidates for encoding with a variation selector.  Indeed, some 
justification should be provided for not using variation selectors in all these cases.

Date/Time: Mon Oct 29 17:25:51 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-309 Revised Proposal to add the Ahom Script in the SMP of the UCS

AHOM DIGITs 1-9 should be spelled AHOM DIGITs ONE-NINE, as is conventional 
in Unicode (digits are used in Unicode names only to represent shapes).  
Furthermore, AHOM DIGIT 10 and AHOM DIGIT 20 should be AHOM NUMBER TEN and 
AHOM NUMBER TWENTY.

It is not clear that Nd is appropriate for the digits of this system.

Error Reports

Date/Time: Mon Aug 13 22:24:39 CDT 2012
Contact: pedberg@apple.com
Name: Peter Edberg
Report Type: Error Report
Opt Subject: Incorrect kMandarin value for U+7565

Currently for 略 U+7565, Unihan has the following:
kHanyuPinyin	42541.110:lüè
kMandarin	è

The kMandarin value is incorrect, it should be lüè (lüe4), per Lee Collins and several others.

NOTE: Comments below from Richard Cook:

Peter's report reveals related bugs in kHanyuPinlu data, making four errata total:

WRONG:
略 [U+7565]
	kHanyuPinlu  ⇒ e4(445)
	kMandarin  ⇒ è
掠 [U+63A0]  
	kHanyuPinlu  ⇒ e4(62)
	kMandarin  ⇒ è

RIGHT:
略 [U+7565]  
	kHanyuPinlu  ⇒ lüe4(445)
	kMandarin  ⇒ lüè
掠 [U+63A0]
	kHanyuPinlu  ⇒ lüe4(62)
	kMandarin  ⇒ lüè

Maybe add this to errata/feedback pile?

The two kHanyuPinlu errors go back to 2003, fixes should be made all around when possible.

-Richard

Date/Time: Tue Aug 21 14:15:53 CDT 2012
Contact: cdutro@twitter.com
Name: Cameron Dutro
Report Type: Other Question, Problem, or Feedback
Opt Subject: Clarifying French Backwards Accent Sorting in TR-10

Note: This comment was passed along to the editorial committee after close of beta.

The TR-10 document is written as though French backwards accent sorting
applies to all French dialects, when in reality it only applies to Canadian
French.  Can the document be updated to mention this fact?  Relevant tickets:
http://unicode.org/cldr/trac/ticket/2905 and
http://unicode.org/cldr/trac/ticket/2984.  Thanks!

Date/Time: Wed Aug 22 18:51:58 CDT 2012
Contact: pedberg@apple.com
Name: Peter Edberg
Report Type: Error Report
Opt Subject: kMandarin error reports from CLDR

We have a couple of CLDR bug reports about pinyin errors for various
characters that are actually the result of errors in the Unihan kMandarin
field for these characters. Here are the CLDR tickets with further details:

 * http://unicode.org/cldr/trac/ticket/3866, Fix pinyin without tones. 

 * http://unicode.org/cldr/trac/ticket/5205, Pinyin errors noted by Åke Persson.

Date/Time: Thu Aug 23 18:19:58 CDT 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UTS #18 code for collation grapheme clusters vs. discontiguous contractions

In L2/12-250 "observation 94" Richard Wordingham points out that "The code in
UTS#18 Annex B does not appear to be able to handle interleaving discontiguous
grapheme clusters."

In UCA 6.2 we are making a fix to the algorithm in UCA section 6.9. Either UTS
#18 should be updated to match, or it should say that it's incomplete and
refer back to UCA.

UCA section 6.9 refers to UTS #18.

Date/Time: Tue Sep 4 16:21:16 CDT 2012
Contact: daniel.buenzli@erratique.ch
Name: Daniel Bünzli
Report Type: Error Report
Opt Subject: UAX 15 Wrong information about Quick_check and stable code points

Hello,

In section 9.1 Stable Code Points of UAX 15. It is said that "characters with
the Quick_Check=YES property value satisfy conditions 1-3".

Unless I'm completely mistaken this is wrong. For every normal form there is
at least one character with Quick_Check=YES and a canonical combining class
*different* from 0.

Here are examples:

U+030D ccc=230 && nfc_quick_check=YES
U+0301 ccc=230 && nfd_quick_check=YES
U+030D ccc=230 && nfkc_quick_check=YES
U+0301 ccc=230 && nfkd_quick_check=YES

Best,

Daniel

Date/Time: Fri Sep 14 12:27:07 CDT 2012
Contact: greg@chown.ath.cx
Name: Grigori Goronzy
Report Type: Error Report
Opt Subject: Error in description of Hangul decomposition

NOTE: This was handed to the editorial committee for action, but the "PS" was added and sent to UTC.

In chapter 3.12, on pages 109-110 of the 6.1.0 core specification it says
for the algorithmic decomposition:

> > If the precomposed Hangul syllable s with the index SIndex (defined above) has the
> > Hangul_Syllable_Type value LVT, then it has a canonical decomposition mapping into a
> > sequence of an LV_Syllable and a T jamo, :
> > LVIndex = (SIndex div NCount) * NCount

But "LVIndex = (SIndex div TCount) * TCount" is correct (the LV precomposed
Hangul forms are TCount codepoints spaced apart).

----------

Thanks. I forgot to include this, here's an example:

Consider the codepoint U+AC23. The full LVT decomposition is 1100 1162
11AE. But we actually want to decompose into an LV part and a T part. So
we can simply recompose the first two codepoints, and we get the
decomposition pair AC1C 11AE. However, the simplified algorithm
documented results in AC00 for the first character of the decomposition
pair.

Best regards
Grigori Goronzy

Date/Time: Wed Oct 3 18:54:22 CDT 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UAX #44 6.2 status of Script_Extensions


http://www.unicode.org/reports/tr44/

section 5.7.6 "Similarly, the provisional Script_Extensions property 
has values which ..."
(Please just remove "provisional".)

section 5.8 "The provisional property Script_Extensions consists of ..."
(Please change to "The Script_Extensions property consists of ...")

See the Changes section: "The status of the Script_Extensions property 
was changed from provisional to informative."

Other Reports

2012/10/02, from Ken Whistler

Rick,

This contribution to the unicode list back in June makes a point which was not
addressed in the 6.2 versions of UAX #14 and UAX #29. So that this doesn't get
lost completely, I suggest that you add it to the other feedback section for
consideration at the November UTC meeting.

--Ken

Subject: A question about the default grapheme cluster boundaries with U+0020 as the grapheme base
Date: Sat, 2 Jun 2012 07:22:01 +0300
From: Konstantin Ritt ritt.ks@gmail.com
To: unicode@unicode.org

It seems like there is an inconsistency between what the default
grapheme clusters specification says and what the test results are
expected to be:

The UAX#29 says:

> Another key feature (of default Unicode grapheme clusters) is that
> default Unicode grapheme clusters are atomic units with respect to the
> process of determining the Unicode default line, word, and sentence
> boundaries.

Also this mentioned in UAX#14:

> Example 6. Some implementations may wish to tailor the line breaking
> algorithm to resolve grapheme clusters according to Unicode Standard Annex
> #29, “Unicode Text Segmentation” [UAX29], as a first stage. Generally,
> the line breaking algorithm does not create line break opportunities within
> default grapheme clusters; therefore such a tailoring would be expected
> to produce results that are close to those defined by the default algorithm.
> However, if such a tailoring is chosen, characters that are members of line
> break class CM but not part of the definition of default grapheme clusters
> must still be handled by rules LB9 and LB10, or by some additional
> tailoring.

However, <U+0020 (SP), U+0308 (CM)> in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9 prohibits break between <U+0020 (Other), U+0308 (Entend)>.
Section 9.2 "Legacy Support for Space Character as Base for Combining
Marks" in UAX#29 clarifies why there is a line break occurs, but the
fact that the statements above are false statements and introduce some
ambiguility.
In case the space character is not a grapheme base anymore the
grapheme cluster breaking rules need to be updated.

Kind regards,
Konstantin

Issue	Name (+ feedback links)
207	Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout Moderated
228	Changing some common characters from Punctuation to Symbol
232	Proposed Update UAX #9, Unicode Bidirectional Algorithm (draft HERE)
233	Proposed Update UTR #20, Unicode in XML and other Markup Languages (draft HERE)

L2/12-347