L2/12-160

Comments on Public Review Issues
(February 6, 2012 - May 4, 2012)

The sections below contain comments received on the open Public Review Issues and other feedback as of May 04, 2012, since the previous cumulative document was issued prior to UTC #130 (February 2012). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Gray items in the Table of Contents do not have feedback here.

Contents:

182 Proposed Update UTS #18: Unicode Regular Expressions
207 Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout  (moderated)
208 Proposed Update UTR #36: Unicode Security Considerations
209 Proposed Update UTS #39: Unicode Security Mechanisms
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports
Assamese


182 Proposed Update UTS #18: Unicode Regular Expressions

See also L2/12-162 and L2/12-187

Date/Time: Sun May 6 23:42:09 CDT 2012
Contact: unicode@norbertlindenberg.com
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: Proposed Update UTS #18 has incorrect example

The proposed update for UTS 18, Unicode Regular Expressions, section 1.5, 
Simple Loose Matches, includes an example showing the expansion of /Dåb/ 
into /(?:d|D)(?:å|Å|Å)(?:b|B)/ . There's no need to repeat Å in the 
expansion; I assume that instead Å, or more clearly \u2128, is meant 
since it also has å as its lower case mapping.

Date/Time: Mon May 7 00:28:43 CDT 2012
Contact: unicode@norbertlindenberg.com
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: Proposed Update UTS #18 is unclear on default case conversion


The proposed update for UTS 18, Unicode Regular Expressions, 
section 2.4, Default Case Conversion, is not very clear on how 
full caseless matches are supposed to be handled in different situations.

The guidance provided seems to cover only the case of literals 
within patterns. It's not clear how, say, a class such as  /[äöüß]/i 
should be handled. Full mapping of "ß" results in "SS", but a 
two-letter string cannot be a member of a set of characters. So, 
should the "SS" be quietly dropped in this case (as the ICU implementation 
does)? Or should the range be rewritten as /(?ä|ö|ü|ss)/i ? Going further, 
should /[a-ß]/i result in an error, or what does it mean?

Date/Time: Mon May 7 11:39:36 CDT 2012
Contact: khw@cpan.org
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: tr18

I was re-reading the draft, and noticed this minor problem 
that I had overlooked:

In section 2.5, it has these:

\p{HANGUL SYLLABLE GAG}
\p{BEL}
\p{BELL}

Did you mean to suggest that all character names should be 
considered properties?  I had never noticed anything like this 
before, and I worry about the possibility of collisions.  
Perl uses e.g., \N{BELL} to specify character names.

207 Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout (moderated)

See the relevant forum. One item was received on the reporting form, see below.

Date/Time: Tue Mar 20 17:35:23 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-102 Updated Proposal to Revise UTR #50

Note: A reply was already sent to John, pointing him to the forum.


Two issues:

The document uses the term "stacked" for horizontal cursive scripts
(Arabic, Syriac, etc.) written vertically so as to be read top-to-bottom.
This style is different from default vertical positioning, but conflating
it with the use of unrotated glyphs in horizontal non-cursive glyphs
(Latin, Greek, etc.) is IMHO more confusing than helpful.

Something also needs to be said about Ogham in section 4. The tables
correctly give it an orientation property of Rotatable-only, but don't
mention that it is written bottom-to-top, and therefore Ogham embedded
in vertical scripts requires bidi handling in all cases.


208 Proposed Update UTR #36: Unicode Security Considerations

Date/Time: Wed Feb 22 21:55:41 CST 2012
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Report Type: Other Question, Problem, or Feedback
Opt Subject: Telugu confusables

Note: This was already sent to the editorial committee.


I notice in the latest meeting minutes:

A.5.2 Action item review.

[130-A1] Action Item for Lisa Moore: Follow up with Andhra Pradesh
on action 125-A17.

[130-A2] Action Item for Eric Muller: Take info for Indic TR and turn
into a document for the doc register.

Where 125-A17 is:

South Asian Subcommittee — TELUGU LENGTH MARK (D.3.1)

[125-A17] Action Item for Manoj Jain: Work with Andhra Pradesh Gov't to
determine what additional clarifications and annotations may be required
for the Telugu script. L2/10-339

[125-A18] Action Item for Eric Muller, Julie Allen, Editorial Committee:
Look for cases to be added to the confusable vowel representation tables
in the Indic chapter(s) for Unicode 6.0. Look at document L2/10-339 Telugu,
and other cases where documentation could be improved.

Since I was the one who submitted the document L2/10-339 requesting
deprecation of Telugu Length Mark, let me just give the list of confusables
I had in mind.  
VS-II  ీ = VS-I ి + LM ౕ 
VS-EE  ే = VS-E ె + LM ౕ
VS-OO ో = VS-O  ొ + LM ౕ
HA హ VS-AA ా -> HAA హా = HA హ LM ౕ

(VS = vowel sign; LM = length mark)

The people with the Action Item can incorporate this into what they write.

[Submitted via the form as per offlist suggestion of Markus Scherer to
ensure it doesn't get forgotten.]



209 Proposed Update UTS #39: Unicode Security Mechanisms

No feedback at this time.


Feedback on Encoding Proposals

Date/Time: Tue May 1 21:51:47 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-123 Proposal to Encode the Sign SIDDHAM for Devanagari


Since this sign is the same in form and function as its Tibetan look-alike
U+0FD3, I think the two should be unified, provided that the Tibetan sign
actually means the same thing (I can't find information about this).  It's a
little strange to incorporate a Tibetan character into Devanagari fonts, but
it does not seem to require any special Tibetan support.  U+0FD3 is Po rather
than So, but as we know that is not a hard and fast distinction.

Date/Time: Tue May 1 21:57:13 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-124 Proposal to Encode Signs for Writing Kashmiri in Sharada


Since the proposed SHARADA SIGN NUKTA has exactly the same form, function, and
properties as the Devanagari version, I think unification should be strongly
considered.  In the words of the proposal, "these signs were used by Kashmiri
scribes in both Sharada and Devanagari", which implies that the Sharada sign
is borrowed from Devanagari.  In general, when a character is borrowed from a
related script, we don't double-encode it unless its range of forms in the
borrowing script are outside the bounds of the lending script, as with the
Kurdish Q.

The other two marks also have Devanagari look-alikes, but clearly don't share
function with them, so they should be encoded.

Date/Time: Wed May 2 10:10:24 CDT 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-011 Preliminary Proposal to Encode Siddham in ISO/IEC 10646


The same issue I raised about DEVANAGARI SIGN SIDDHAM applies here also: this
should be unified with Tibetan U+0FD3, provided the semantics is the same.

Failing that it should at least be unified with the Devanagari sign, since
there is plenty of precedent for sharing Devanagari punctuation/symbols with
other Indic scripts.


Other Reports

Date/Time: Tue Feb 7 14:10:56 CST 2012
Contact: unicode@farah.cl
Name: Miguel Farah
Report Type: Error Report
Opt Subject: Clarifications suggested for the DOLLAR SIGN and PESO SIGN code points.

Note: This was already sent to the editorial committee.


I'd like to suggest the following clarifications in the Unicode Names List:

1) To avoid confusion between the Latin-American Peso currencies and the
Filipino currency, add an alias "Filipino Peso Sign" to U+20B1.

2) Modify the comment for the U+20B1 code point to state something like
"Extant and discontinued Latin-American Peso currencies (Mexican, Chilean,
Colombian, Dominican, etc.) use the dollar sign.".

3) Change the spelling from "milreis" to "milréis" in the informative
aliases for U+0024.

4) Add a comment to U+0024 along the lines of "The dollar symbol is used for
many peso currencies in Latin America and elsewhere, except U+20B1, which is
used for the Philippine peso.".

For rationale and background for this request, please see the Unicode Forum
Discussion at http://www.unicode.org/forum/viewtopic.php?f=21&t=261 .

Please use the provided background information to also add to the description in
Chapter 15, Currency Symbols, where neither Dollar nor Peso (Philippine) are
currently discussed explicitly today, while Yen/Yuan is.

Thank you.

Date/Time: Mon Feb 27 18:39:27 CST 2012
Contact: roozbeh@google.com
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: U+0342 Combining Greek Perispomeni needs more info

Note: This was already sent to the editorial committee.


I was looking at the charts, just discovering U+0342 COMBINING GREEK
PERISPOMENI. It really confused me, thinking a glyph error has found
its in the charts.

I think it would be a good idea if some minor explanation is added to
the NamesList, together with a reference to U+0303 COMBINING TILDE.

Date/Time: Thu Mar 8 09:59:07 CST 2012
Contact: loic.etienne@tech.swisssign.com
Name: Loïc Etienne
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: Annex #15: Function composition rules

Note: This was already sent to the editorial committee.


http://unicode.org/reports/tr15/ , 7 Design Goals, 7.2 Stability
could state explicitly:

Compatibility NF is stronger than canonical NF:
 * toNFC(toNFKC(x)) = toNFKC(x)
 * toNFD(toNFKD(x)) = toNFKD(x)

More generally, compatibility is absorbing:
 * toNFC(toNFKD(x)) = toNFKC(x)
 * toNFD(toNFKC(x)) = toNFKD(x)
 * toNFKC(x) = toNFKC(toXXX(x))
 * toNFKD(x) = toNFKD(toXXX(x))
where toXXX is any of toNFD, toNFKD, toNFC, toNFKC.

Date/Time: Fri Mar 30 17:11:10 CDT 2012
Contact: fantasai.lists@inkedblade.net
Name:
Report Type: Error Report
Opt Subject: Turkish casing applies also to chr/tt/ba


Mozilla received a report that the Turkish casing rules also apply
to Crimean Tatar (crh), Volga Tatar (tt), and Bashkir (ba):
  https://bugzilla.mozilla.org/show_bug.cgi?id=231162#c17
If so, the Unicode SpecialCasing.txt file needs updating.

Date/Time: Tue Apr 3 16:52:11 CDT 2012
Contact: petercon@microsoft.com
Name: Peter Constable
Report Type: Error Report
Opt Subject: EastAsianWidth properties for new Hangul jamo

Note: This was already sent to the editorial committee.


When new Hangul characters were added in Unicode 5.2, it appears that they
were all given an EastAsianWidth property value of W. This is the case
regardless of the type of jamo. But that is not consistent with properties
that were assigned to jamo that predate TUS 5.2: choseong characters
(1100..1159) were given a width value W, but jungseong (1160..11A2) and
jongseong (11A8..11F9) were given a width value N. Thus, all of the newer
jungseong and jongseong characters have different width values than the
older jungseong and jongseong characters.

Unless there was a specific reason for setting these characters to W,
I suggest that the following have their East Asian Width values set to
N: 11A3..11A7, 11FA..11FF, D7B0..D7FB.

Date/Time: Sat Apr 28 16:33:02 CDT 2012
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Other Question, Problem, or Feedback
Opt Subject: Storage Order of Decimal Digits


There is no declared policy on the storage sequence of decimal digits,
i.e. characters with general category Nd.  What is currently done could
be summed up as:

'The Bidi class of decimal digits shall be such that a sequence of digits
from the same set of 10 contiguous character points shall be stored in
order of decreasing significance when representing a number'.

This could be included in the stability guarantee at
http://www.unicode.org/policies/property_value_stability_table.html .

At present, all decimal digits have the Bidi class EN, AN or L except for
the N'ko decimal digits, which have the Bidi class R.  If this principal
were violated, a 'simplistic parser' could misinterpret values of digit
sequences. (Not that it would be likely to get the prime number 25₁₆ right either!)

The guarantee, converted to a statement of practice, could reasonably be
included in the TUS section on 'Numeric Value', currently Section 4.6.

It would be good to say there that this principle is and will generally be
followed for characters that primarily function similarly to 'decimal digits',
e.g. for other radices or for derived characters such as superscript numerals.
(The word 'primarily' allows the principle to be ignored for letters also
used as digits.)

Date/Time: Wed May 2 13:52:23 CDT 2012
Contact: unicode@norbertlindenberg.com
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: UTS 10: Case level should be between primary and secondary level


Section 5.1, Parametric Tailoring, of UTS 10 describes caseLevel as "If set to
on, a level consisting only of case characteristics will be inserted in front
of tertiary level. To ignore accents but take cases into account, set strength
to primary and case level to on."

I think "in front of tertiary level" should really be "between primary and
secondary level". "In front of tertiary level" is normally interpreted as
"between secondary and tertiary level", but then it would still distinguish
based on accents.

I reported the same issue against UTS 35 as http://unicode.org/cldr/trac/ticket/4698 

Date/Time: Sat May 5 09:41:27 CDT 2012
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: Numeric values for Cuneiform digits


Numeric values for Cuneiform digits (or digit parts).

For some Cuneiform characters the numeric value is missing or wrong.

Here is a list of proposed corrections.

Note that the character for 20 is currently missing in Unicode.

12079;CUNEIFORM SIGN DISH;Lo;0;L;;;;;N;;;;;			-> 1
1222B;CUNEIFORM SIGN MIN;Lo;0;L;;;;;N;;;;;			-> 2

1230B;CUNEIFORM SIGN U;Lo;0;L;;;;;N;;;;;			-> 10
								...20
1230D;CUNEIFORM SIGN U U U;Lo;0;L;;;;;N;;;;;			-> 30
1240F;CUNEIFORM NUMERIC SIGN FOUR U;Nl;0;L;;;;4;N;;;;;		4 -> 40
12410;CUNEIFORM NUMERIC SIGN FIVE U;Nl;0;L;;;;5;N;;;;;		5 -> 50
12411;CUNEIFORM NUMERIC SIGN SIX U;Nl;0;L;;;;6;N;;;;;		6 -> 60
12412;CUNEIFORM NUMERIC SIGN SEVEN U;Nl;0;L;;;;7;N;;;;;		7 -> 70
12413;CUNEIFORM NUMERIC SIGN EIGHT U;Nl;0;L;;;;8;N;;;;;		8 -> 80
12414;CUNEIFORM NUMERIC SIGN NINE U;Nl;0;L;;;;9;N;;;;;		9 -> 90

There was also a notion, and glyph(s), for the digit 0 (even if not the concept 0). See http://www.jstor.org/discover/10.2307/593904?uid=3738984&uid=2129&uid=2&uid=70&uid=4&sid=21100772121331, http://en.wikipedia.org/wiki/Babylonian_numerals#Numerals, http://gwydir.demon.co.uk/jo/numbers/babylon/index.htm. I don't dare a guess, here, as to which character(s), if any of the currently encoded ones, that is/are.


Assamese

Date/Time: Sun Apr 1 22:48:13 CDT 2012
Contact: azihaque@yahoo.co.in
Name: Aziz-ul Haque
Report Type: Error Report
Opt Subject: Place of Assamese

Note: This was already sent to the editorial committee.


Dear Sir/Madam
Would you please inform me about the latest position of Assamese writing
system in Unicode? Earlier the Unicode said,Bengali script is used in
writing Assamese. We disagree, since we have our own script that has a
history of 1500 years and from which developed Bengali and Maithili.
Moreover, at least 15 characters of Assamese are different from modern
Bengali. With all documentary evidences and our state government's approval
we have been requesting the Unicode to provide a separate slot for Assamese.

Sincerely yours
A. Haque 

Note: This was already sent to the editorial committee.

Date/Time: Sat May 5 09:17:02 CDT 2012
Contact: ashok2001sarma@rediffmail.com
Name: Ashok Sarma
Report Type: Other Question, Problem, or Feedback
Opt Subject: Each and Every version carries wrong information regarding assamese scripts


  
  Sir/Madam,

            With due respect again I inform you that assamese script is not 
            Bengali script. Historically also,the typeset prepared by British 
            was sampled from assamese manuscript.Again,the oldest written form 
            of assamese script was found in "Charyapad".The language of 
            "Charyapad" is Kamrupi. Even the book on Origin of Bangla Script 
            was written collecting the inscription,manuscripts of assamese 
            writings. Then why your consortium repeated the same mistake 
            hurting the self esteem of assamese people. If you need scientific 
            proofs in support of special identity of assamese script, please 
            let us know the way to establish the truth. I respect your 
            consortium and I understand also the importance of your 
            consortium. But I never want being an assamese person any 
            wrong information in your version underestimating any assamese 
            scrips and language.

            I am eagerly looking forward for your valuable suggestion for 
            not hurting the sentiment of assamese people further.