L2/12-248

Comments on Public Review Issues
(May 4, 2012 - July 24, 2012)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 24, 2012, since the previous cumulative document was issued prior to UTC #131 (May 2012). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Gray items in the Table of Contents do not have feedback here.

Contents:

The links below go to directly to the feedback documents for open PRIs, as of July 24, 2012.

IssueName
207Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout Moderated
210 Proposed Update UAX #9: Unicode Bidirectional Algorithm (none)
211 Proposed Update UAX #11: East Asian Width (none)
212Proposed Update UAX #14: Unicode Line Breaking Algorithm
213 Proposed Update UAX #15: Unicode Normalization Forms (none)
214 Proposed Update UAX #24: Unicode Script Property (none)
215 Proposed Update UAX #29: Unicode Text Segmentation (none)
216 Proposed Update UAX #31: Unicode Identifier and Pattern Syntax (none)
217 Proposed Update UAX #34: Unicode Named Character Sequences (none)
218 Proposed Update UAX #38: Unicode Han Database (Unihan) (none)
219 Proposed Update UAX #41: Common References for Unicode Standard Annexes (none)
220 Proposed Update UAX #42: Unicode Character Database in XML (none)
221Proposed Update UAX #44: Unicode Character Database
222 Proposed Update UAX #45: U-Source Ideographs
223Proposed Update UTS #10: Unicode Collation Algorithm
224 Proposed Update UTS #46: Unicode IDNA Compatibility Processing (none)
225 Use of Accented Pinyin for kHanyuPinlu in the Unihan database (none)
226 Deprecation of kCompatibilityVariant in the Unihan database (none)
227Changes to Script Extensions Property Values
228Changing some common characters from Punctuation to Symbol
229Linebreaking Changes for Pictographic Symbols
230Unicode 6.2.0 Beta
231Bidi Parenthesis Algorithm

The links below go to locations in this document for feedback.

Feedback on Encoding Proposals (some held over from previous meeting)
Closed Public Review Issues
Error Reports
Other Reports

 


Feedback on Encoding Proposals

Items from John Cowan in May 2012 forwarded from previous meeting.

Date/Time: Tue May 1 21:51:47 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-123 Proposal to Encode the Sign SIDDHAM for Devanagari


Since this sign is the same in form and function as its Tibetan look-alike
U+0FD3, I think the two should be unified, provided that the Tibetan sign
actually means the same thing (I can't find information about this).  It's a
little strange to incorporate a Tibetan character into Devanagari fonts, but
it does not seem to require any special Tibetan support.  U+0FD3 is Po rather
than So, but as we know that is not a hard and fast distinction.

Date/Time: Tue May 1 21:57:13 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-124 Proposal to Encode Signs for Writing Kashmiri in Sharada


Since the proposed SHARADA SIGN NUKTA has exactly the same form, function, and
properties as the Devanagari version, I think unification should be strongly
considered.  In the words of the proposal, "these signs were used by Kashmiri
scribes in both Sharada and Devanagari", which implies that the Sharada sign
is borrowed from Devanagari.  In general, when a character is borrowed from a
related script, we don't double-encode it unless its range of forms in the
borrowing script are outside the bounds of the lending script, as with the
Kurdish Q.

The other two marks also have Devanagari look-alikes, but clearly don't share
function with them, so they should be encoded.

Date/Time: Wed May 2 10:10:24 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-011 Preliminary Proposal to Encode Siddham in ISO/IEC 10646


The same issue I raised about DEVANAGARI SIGN SIDDHAM applies here also: this
should be unified with Tibetan U+0FD3, provided the semantics is the same.

Failing that it should at least be unified with the Devanagari sign, since
there is plenty of precedent for sharing Devanagari punctuation/symbols with
other Indic scripts.

Date/Time: Sat Jun 16 07:07:37 CDT 2012
Contact: jeno@hu.ibm.com
Name: Demeczky Jenő
Report Type: Feedback on an Encoding Proposal
Opt Subject: Supporting proposal N4183 of the Rovas encoding


Dear Madam/Sir,

WG2 is going to vote on the above mentioned subject. I would like to
contribute my opinion on it.

Ad N4268: The proposed name for the "rovas" script (Old Hungarian) is
incorrect and misleading. "Old Hungarian" is the name of the Hungarian
language spoken in a well defined period of time (896-1526), having another
kind of written form, the Latin alphabet. The "rovas" alphabet was used in a
wide time frame, and not only by Hungarian, but by almost all Middle-Asian
languages. It is definitely wrong to give the name of a special Hungarian
language variant to a kind of writing system used by many languages throughout
several thousand years. Proposal N4268 covers only one version of the "rovas"
script, which was used about 100 years ago in the then Austro-Hungarian
Empire. Because of this restricted approach, it is impossible to read "rovas"
scripts of former historical periods. Several letter names of proposal N4268
are in error, they do not comply with the Hungarian linguistic terminology.

Ad N4183: The proposed name “Rovas” for the “rovas” script rightly refers to a
writing system and not to a language. Moreover proposal N4183 covers several
historical periods, geographic areas and languages. It is open to include any
further discovery and enhancement either in time, or geography, or language
family. Proposal N4183 takes into account language variations and contemporary
communication needs as well. The character names of proposal N4183 are in tune
with the traditional terminology of Hungarian linguistics.

If there are no new proposals, and the workgroup has to decide between the
above mentioned two, I ask you to vote for N4183.

Best regards,
Demeczky Jenõ
MSc in electronic engineering, BME
MA in general and applied linguistics, ELTE
IBM World Wide Translation Terminologist
IBM Translation Services Center Terminologist for Central and Eastern Europe
IBM Hungarian Terminologist
Phone:	+36 (1) 382 5827
Mobile:	+36 (20) 823 5560
EMail:	jeno@hu.ibm.com
International Business Machines Corporation Magyarországi Kft. Neumann János utca 1. Budapest 1117




Error Reports

Date/Time: Thu May 10 12:35:45 CDT 2012
Contact: rscook@unicode.org
Name: RC
Report Type: Error Report
Opt Subject: kMandarin


kMandarin for 晨 U+6668 is given only as "chen", the reading in the most common
polysyllable 早晨 zǎochen. But, why is there no tonal reading given? For
example, the reading is "chén" in the next most common polysyllable (清晨
qīngchén), and "chén" is in fact the only reading in other polysyllables. The
atonal "chen" reading occurs only in zǎochen.

A polysyllabic input method that uses "chen" to mean atonal (as opposed to
"unspecified tone") would probably be OK with this (since there are only a
couple atonal "chen" characters). But trying to input the character alone by
typing "chen" would effectively fail, because "chen" "unspecified tone" finds
too many characters to sort through. And typing other polysyllables (in which
the reading is "chén") will fail completely.

Since tr38 says: 
"The most customary pinyin reading for this character; that is, the reading
most commonly used in modern text, with some preference given to readings most
likely to be in sorted lists."

It seems that the emphasis is on common character-level (i.e. monosyllabic)
readings, so atonal readings (in polysyllables) would not be preferred, though
they probably should not be excluded.

In general, it would probably be best if kMandarin records containing only
atonal readings were updated to include the common tonal readings as well.
Compare, for example, kHanyuPinlu "chen5(124) chen2(28)" for 晨 U+6668, which
does it right.

Date/Time: Mon May 14 15:41:44 CDT 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UCA "shifted" conformance test file bugs


(From the unicore list thread "Sterner Collation Test and Possible Conformance Test Bug")

On Sun, May 13, 2012 at 9:08 PM, Richard Wordingham <richard.wordingham@ntlworld.com> wrote:
> > I believe I have found a bug in the conformance test for the shifted
> > variable weighting.  The test expects <U+0FB2 TIBETAN SUBJOINED LETTER
> > RA, U+0F81 TIBETAN VOWEL SIGN REVERSED II, U+003F QUESTION MARK> to
> > order before <U+0F76 TIBETAN VOWEL SIGN VOCALIC R, U+0F71 TIBETAN VOWEL
> > SIGN AA, U+0334 COMBINING TILDE OVERLAY>
> > (and similarly <U+0FB3, U+0F81, U+003F> to order
> > before <U+0F78, U+0F71, U+0334>).
(Here is my own analysis.)

[The two adjacent lines from the "shifted" conformance test files:]

a) 0FB2 0F81 003F
b) 0F76 0F71 0334

where 0F76 == 0FB2 0F80
so when we apply NFD we get

a) 0FB2 0F81 003F
b) 0FB2 0334 0F71 0F80

allkeys.txt has the following contractions:
0FB2 0F81 ; [.2578.0020.0002.0F77]
0F71 0F80 ; [.2574.0020.0002.0F81]

and since there is no contraction of 0FB2+0F71, we don't get any discontiguous-contraction match.

So it looks like the collation elements would be
a) [.2578.0020.0002.0F77] [*0263.0020.0002.003F]
b) [.255A.0020.0002.0FB2] [.0000.007C.0002.0334] [.2574.0020.0002.0F81]

therefore, a) > b)

[...]

I assume that Mark's program to generate the conformance test files is adding
the two contractions that are missing from the DUCET: 0FB2+0F71 and 0FB3+0F71,
see our doc L2/12-131R.

When 0FB2+0F71 exists, then the whole 0FB2+0F71+0F80 will be found as well,
and then b) yields [.2578.0020.0002.0F77] [.0000.007C.0002.0334] and thus a) <
b).

Date/Time: Fri May 18 23:14:50 CDT 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: PropertyValueAliases.txt wrong for ccc=132

NOTE: This was discussed by the officers.


http://www.unicode.org/Public/6.1.0/ucd/PropertyValueAliases.txt
has:

ccc; 132; CCC133                     ; CCC133

I request that this be fixed without retaining a compatibility alias for "CCC133".

Date/Time: Mon May 21 21:47:25 CDT 2012
Contact: samjnaa@gmail.com
Name: Shriramana Sharma
Report Type: Other Question, Problem, or Feedback
Opt Subject: Annotation to 0024 DOLLAR SIGN


Currently 0024 DOLLAR SIGN has an annotation:

• Other currency symbol characters 20A0 ₠ - 20B9 ₹

First of all, none of the other currency symbols in the ASCII set have an
annotation pointing to the larger set of currency symbols. I don't see as how
why only the dollar should have such an annotation.

Even if there is a need, just point the readers to the currency symbols block.
That way you will not have to update this item each time a currency symbol is
added. (Now the Turkish Lira sign is going to be published.)

Date/Time: Fri May 25 21:48:34 CDT 2012
Contact: stevendaniels88@gmail.com
Name: Steven Daniels
Report Type: Error Report
Opt Subject: Unihan Database: Codepoint missing kMandarin field


UnihanReadings.txt
U+3402 㐂 is missing pinyin fields. 
kMandarin should be xǐ

source: http://www.zdic.net/zd/zi2/ZdicE3Zdic90Zdic82.htm


The following codepoint is also missing pinyin, and I've been unable to find pinyin for it. ds
U+3427 㐧

Preliminary response from John Jenkins 2012-05-26: Well, this doesn't have a Mandarin reading because it isn't a Chinese character, it's Japanese. None of our Chinese sources or dictionaries include it. xǐ is arguably what it's reading *would* be if it *were* Chinese, but we don't have a general policy for such cases.

In any event, we need to treat changes to kMandarin as UTC issues, so this should just be added to a general feedback document.

Date/Time: Fri Jun 1 16:07:37 CDT 2012
Contact: asmus@unicode.org
Name:
Report Type: Error Report
Opt Subject: Inconsistent documentation of Unihan sources


There's an inconsistency between the code charts and the UCD Database.

For example, for U+5655, the kIRG-GSource is listed as G1-7D69 in the Unihan
Database for 6.1, but as GHZ-10685.05 in the code charts for 6.1.

This is probably not the only such discrepancy.

The problem seems to be that there's a breakdown in the process of tracking
source updates. Note that the unihan DB for version 6.1 is fully 5 month older
than the rest of the UCD (per date stamps on zip file).



Date/Time: Tue May 29 16:01:38 CDT 2012
Contact: rswihananto@gmail.com
Name: R.S. Wihananto
Report Type: Error Report
Opt Subject: Alias of U+A980 JAVANESE SIGN PANYANGGA


The informative alias for U+A980 JAVANESE SIGN PANYANGGA in the code chart
should be "candrabindu", not "ardhacandra" as in the current code chart. The
proof of this claim:

* In Proposal for Encoding the Javanese Script in the UCS
(http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3319.pdf), section 2.4, Michael
Everson wrote that the characters PANYANGGA, CECAK, and WIGNYAN are
analogues to Devanagari CANDRABINDU, ANUSVARA, and VISARGA.

Also, The Unicode Standard chapter 11 Southeast Asian Scripts, page 400, says
that the character U+A980 JAVANESE SIGN PANYANGGA, U+A981 JAVANESE SIGN CECAK,
and U+A983 JAVANESE SIGN WIGNYAN are analogues to U+0901 DEVANAGARI SIGN
CANDRABINDU, U+0902 DEVANAGARI SIGN ANUSVARA, and U+0903 DEVANAGARI SIGN
VISARGA.

In the Javanese code chart, U+A981 CECAK and U+A983 WIGNYAN are correctly
labeled with alias "anusvara" and "visarga" respectively. But U+A980 PANYANGGA
is incorrectly labeled with alias "ardhacandra". It should be "candrabindu".

* Javanese and Balinese script are closely related. The Balinese counterpart
of U+A980 JAVANESE SIGN PANYANGGA is U+1B01 BALINESE SIGN ULU CANDRA. U+1B01
ULU CANDRA has alias "candrabindu". Both PANYANGGA and ULU CANDRA are used
to write sacred Hinduism syllable "OM". So the alias of U+A980 PANYANGGA
should be corrected to "candrabindu".

Date/Time: Wed Jun 13 20:05:29 CDT 2012
Contact: katmomoi@gmail.com
Name: Kat Momoi
Report Type: Error Report
Opt Subject: kTotalStrokes info on U+8303


In Unihan-6.2.0d1/Unihan_DictionaryLikeData.txt,

the character U+8303 (范) is listed as having the following stroke count:

U+8303 kTotalStrokes 8

This is generally considered to be the stroke count for Simplified Chinese
where the grass radical is counted as "3". In Traditional Chinese, it is
customary to count the grass radical as "4".

The 3 stroke radical is: 艹 (U+8279)
The 4 stroke radical is: 艹 (U+FA5E)

Though this is usually not used in counting strokes, the older form of this 
radical is actually: 艸 (U+8278) 6 strokes

My ask is that we revise the data in "Unihan-6.2.0d1/Unihan_DictionaryLikeData.txt" to:

U+8303 kTotalStrokes 8 9

"8" followed by {sp} followed by "9"

to indicate that the Simplified count of this character is "8" but 
the traditional count is "9".

Date/Time: Fri Jun 22 07:48:11 CDT 2012
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: cuneiform glyph error


The chart glyphs for U+12423-U-12432 have the lower right wedge pointing in
the wrong direction. See
http://www.cdli.ucla.edu/tools/SignLists/KWU/HTML/HP0126.html .

The same is likely for U+12072 and U+121A0-U+121A3.

Chart glyphs for other characters whose glyphs are based on HI seem ok.

Date/Time: Fri Jun 22 14:01:22 CDT 2012
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: Collation of LAO LETTER KHMU GO


When U+0EDE LAO LETTER KHMU GO was added to Unicode 6.1.0, the corresponding 
addition to the Default Unicode Collation Entry Table was incomplete.  There 
is no reason to doubt that this consonant, like most Lao consonants, can 
co-occur with 'logical order exception' vowels, and therefore there should 
be reversing contractions for U+0EDE and these vowels. 
 

Date/Time: Mon Jul 2 09:25:44 CDT 2012
Contact: corbett.dav@husky.neu.edu
Name: David Corbett
Report Type: Error Report
Opt Subject: Typo in alias for U+1110E


In the Unicode 6.1 chart for Chakma
(http://www.unicode.org/charts/PDF/U11100.pdf) the alias for
U+1110E CHAKMA LETTER JAA is "dvipadalaa haa". Shouldn't that
be "dvipadalaa jaa"?


Other Reports

Date/Time: Thu Jul 19 23:29:09 CDT 2012
Contact: khw@cpan.org
Name: Karl Williamson
Report Type: Other Question, Problem, or Feedback
Opt Subject: New TR18 uses non-standard character name abbrs.


I don't think you should use abbreviations for code point names in an official 
Unicode document that aren't official name aliases.  I was reading the just 
released TR18 rev. 15, and noticed this that I hadn't noticed before.  
Sorry.  It introduces the abbreviations PS (U+2029) and LS (U+2028).  
I believe these either should be added to NameAliases.txt if these are 
commonly used, or the full names should be spelled out in TR18, if not.