Comments on Public Review Issues

L2/12-042

Comments on Public Review Issues
(October 27, 2011 - February 6, 2012)

The sections below contain comments received on the open Public Review Issues and other feedback as of February 06, 2012, since the previous cumulative document was issued prior to UTC #129 (October 2011). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Gray items in the Table of Contents do not have feedback here.

182 Proposed Update UTS #18: Unicode Regular Expressions
207 Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout (moderated)
208 Proposed Update UTR #36: Unicode Security Considerations
209 Proposed Update UTS #39: Unicode Security Mechanisms
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports

182 Proposed Update UTS #18: Unicode Regular Expressions

No feedback at this time.

207 Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout (moderated)

See the relevant forum.

208 Proposed Update UTR #36: Unicode Security Considerations

Date/Time: Thu Jan 12 04:47:44 CST 2012
Contact: gerv@mozilla.org
Name: Gervase Markham
Report Type: Public Review Issue
Opt Subject: UTR#36: clarity suggestions


I have a couple of suggestions for improving the clarity of UTR #36, in particular section 2.9:
http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts

"2. Highly Restrictive

    All characters in each identifier must be from a single script, or from the combinations:
    ASCII + Han + Hiragana + Katakana;
    ASCII + Han + Bopomofo; or
    ASCII + Han + Hangul
    No characters in the identifier can be outside of the Identifier Profile

Note that this level will satisfy the vast majority of Latin-script users.

3. Moderately Restrictive

    Allow Latin with other scripts except Cyrillic, Greek, Cherokee
    Otherwise, the same as Highly Restrictive"


My issues are:

A) It refers to ASCII as "a script"; this is confusing, and I assumed 
it was a typo for "Latin". I am told it is not. Therefore, it should 
be explicitly mentioned that this is intentional, and made clear how 
"ASCII" is defined in Unicode codepoint terms. (Is there a Unicode 
property for it?)

B) I am told that the intent of 3 is to allow Latin with any other 
_single_ script except Cyrillic, Greek or Cherokee - but this is not 
at all clear. I suggest using the following replacement text: "Allow 
Latin with any other single script except Cyrillic, Greek or Cherokee."

Hope that helps,

Gerv

Date/Time: Mon Jan 30 12:50:28 CST 2012
Contact: patrick.jones@icann.org
Name: Patrick Jones
Report Type: Public Review Issue
Opt Subject: UTR #36: Unicode Security Considerations


(Note: Filed in Edcom TRAC for Mark)

In the proposed update to UTR #36, some of the terminology and links 
need to be updated. The term for ICANN in the References section should 
be updated. The latest version of the IDN Guidelines is version 3.0, 
and can be found at http://www.icann.org/en/topics/idn/implementation-guidelines.htm. 
ICANN's informational page on IDNs is available at http://www.icann.org/en/topics/idn/.

IDNA2008 is referred in this document as a draft, UTR #36 should 
delete "draft" before the IDNA2008 specification. An additional 
informational RFC is RFC 5895, Mapping Characters for Internationalized 
Domain Names in Applications (IDNA) 2008, located at 
http://tools.ietf.org/html/rfc5895. You should also include RFC 6452, 
The Unicode Code Points and Internationalized Domain Names for 
Applications (IDNA) - Unicode 6.0, located at http://tools.ietf.org/html/rfc6452.

UTR #36 may also want to reference the work currently underway in 
the ICANN's IDN Variant Project. The terminology section of the 
Integrated Issues Report (http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf), 
while based on Unicode, also contains Terminology Used in 
Internationalization in the IETF, RFC 6365, and several additional 
terms introduced in the examination of the Variant Project Issues Reports.

Please let me know if you need additional information.

Best regards,

Patrick

Patrick L. Jones
Sr. Mgr, Security
IDN team
ICANN

209 Proposed Update UTS #39: Unicode Security Mechanisms

Date/Time: Mon Jan 30 13:00:15 CST 2012
Contact: patrick.jones@icann.org
Name: Patrick Jones
Report Type: Public Review Issue
Opt Subject: UTS #39: Unicode Security Mechanisms


(Note: Filed in Edcom TRAC for Mark)

In the proposed update to UTS #39, some of the terminology and 
links need to be updated. As in the comments I submitted on UTS 
#36, in the references section at the bottom of the document, 
IDNA2008 is referred in this document as a draft, UTS #39 should 
delete "draft" before the IDNA2008 specification. An additional 
informational RFC is RFC 5895, Mapping Characters for Internationalized 
Domain Names in Applications (IDNA) 2008, located at http://tools.ietf.org/html/rfc5895. 
You should also include RFC 6452, The Unicode Code Points and 
Internationalized Domain Names for Applications (IDNA) - Unicode 6.0, 
located at http://tools.ietf.org/html/rfc6452.

UTS #39 may also want to reference the work currently underway in 
the ICANN's IDN Variant Project. The terminology section of the 
Integrated Issues Report (http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf), 
while based on Unicode, also contains Terminology Used in 
Internationalization in the IETF, RFC 6365, and several additional 
terms introduced in the examination of the Variant Project Issues 
Reports. This report also contains a section on visual similarity 
cases and whole-string issues which may be of use with UTS #39.

Please let me know if you need additional information.

Best regards,

Patrick

Patrick L. Jones
Sr. Mgr, Security
IDN team
ICANN

Other Reports

Date/Time: Thu Nov 10 02:33:11 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX#29: word breaks with hiragana and voiced marks


I'd like to renew an old feedback I made about word breaks with
hiragana and voiced marks in an UAX#29 PRI in... 2007. Because
absoluetly nobody seems to have replied to this feedback, and visibly
some characters that are used in both hiragana and katakana are not
treated consistently as they should (for example with differences
between normal and halfwidth variants).

See http://unicode.org/mail-arch/unicode-ml/y2007-m08/0091.html

Quoting the message:
This UAX treats KATAKANA specially, to avoid breaking between two
Katakana letters, but still break between hiragana. However, this
is probably not true for every thing, notably in the sequence of
an Hiragana letter and a voiced/semi voiced mark:
U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
and possibly other characters currently listed in the Katakana value in table 3:
U+3031 (〱) VERTICAL KANA REPEAT MARK
U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK
U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF
U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF
U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF
U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF70 (ｰ) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF9E (ﾞ) HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F (ﾟ) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
Do really word break occur between Hiragana letters and these
marks coded after them (note that Hiragana letters are excluded
from "Aletter" in table 3) ? If not, then
(1) the list of characters above should better be listed under a
    separate value (say "ExtendKana"), and removed from Katakana in table 3.
(2) a new value "Hiragana" should be created for Hiragana letters in table
    3, like this:
        Katakana    script="KATAKANA" (rewritten first row in table 3)
        Hiragana    script="HIRAGANA" (new inserted row in table 3)
        ExtendKana    (the list of characters above) (new row in table 3)
(3) the existing rule WB13 (Katakana × Katakana) should be rewritten
    equivalently as:
        WB13. (Katakana | ExtendKana) × (Katakana | ExtendKana)
(4) the following subrules WB13a and WB13b rewritten equivalently as:
        WB13a. (ALetter | Numeric | Katakana | ExtendKana | ExtendNumLet)
                × ExtendNumLet
        WB13b. ExtendNumLet × (ALetter | Numeric | Katakana | ExtendKana)
(5) Another subrule should be added:
        WB13c. (Hiragana | ExtendKana) × ExtendKana
No other change is needed, because word break will still occur either
between two Hiragana letters, or after an ExtendKana and before a
Hiragana letter, in the next rule:
        WB14. Any ÷ Any
Or am I missing something?

Date/Time: Mon Nov 7 17:36:02 CST 2011
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: Bad @missing line in DerivedNumericValues.txt


DerivedNumericValues.txt has the following @missing line:

# @missing: 0000..10FFFF; ; NaN

It should be corrected to

# @missing: 0000..10FFFF; NaN; ; NaN

The format is documented as
  range;nv-as-decimal;nt-was-removed;nv-as-fraction
and the current @missing line is missing the nv-as-decimal field, 
placing the NaN into the nt-was-removed field.

This bug is in every version since 5.1.0 when the @missing field was 
first added. It is still in 6.1 beta (DerivedNumericValues-6.1.0d9.txt).

Rather than (or in addition to) changing it here, it would be best to 
follow the suggestion in L2/11-358 "Parsing the UCD" A.1.a and add all 
@missing lines for properties with non-null defaults into 
PropertyValueAliases.txt, with consistent syntax. (See current examples in that file.)

If DerivedNumericValues.txt does not get fixed, it should be noted 
in the errata and documented in the file's own header.

Date/Time: Tue Nov 8 00:30:19 CST 2011
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: more on UCD @missing & L2/11-358


1. I cannot find a @missing line with the default value for 
gc=General_Category, not in the UCD files nor in 
http://www.unicode.org/L2/L2011/11358-ucd-parsing/ExtraPropertyValueAliases.txt

2. One more comment on L2/11-358 "Parsing the UCD" A.1.a: I would 
really like to see all of the @missing lines in PropertyValueAliases.txt. 
Reason: Some properties (e.g., dt & nt) can be easily parsed from other 
files (e.g., UnicodeData.txt) which makes it pointless to parse their 
dedicated DerivedXyz.txt files except to get the @missing value. For 
backward compatibility, lines that are already elsewhere could be 
duplicated here. Some properties already have two @missing lines 
in the UCD (e.g., ea & lb).

3. [MOOT - Editorial Committee already handled point #3.]

4. L2/11-358 A.8 says "In PropertyValueAliases, all but ccc have the 
same field order. Not sure how to do this, but it would be less ugly 
to parse if it had the same format!"
-> I recommend against changing this. ICU, and likely other 
implementations, 
has always ignored the ccc-specific syntax comment and treated 
the numeric values as the short names, and the short words as 
the "long names". The numeric value is used and listed practically 
everywhere anyway, especially since the "fixed" values do not have any names listed.

5. L2/11-358 B & C
As a parser implementer, I much prefer fewer files and each with 
range;property-name;value like in DerivedCoreProperties.txt rather 
than property-specific files. And UnicodeData.txt does not fit the 
newer files' format but it's still pretty easy to parse so I don't 
see much need to change that at this point.
FYI: For ICU, I just wrote a Python script that pre-parses the UCD 
(yes, yet another UCD parser) and generate a combined .txt file with 
all of the data relevant for ICU (using key-value pairs) so that I 
can then substantially simplify the binary-generating C code. Therefore, 
I have very recent experience writing yet another parser

6. I don't see how the CaseFolding.txt "T" mappings can be represented 
in properties; C+S go into scf, C+F go into cf, but where do T mappings 
go? (ICU so far has parsed CaseFolding.txt without worrying about formal 
properties for its values. It's getting interesting when recasting the 
data into a different form. I don't see these in the UCD XML either.)

0049; T; 0131; # LATIN CAPITAL LETTER I
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Maybe add tcf=Turkic_Case_Folding?

7. Similarly, there does not seem to be any way to express conditional 
case mappings from SpecialCasing.txt in formal properties.

FYI: ICU has always just stored an "is conditional" bit for characters 
with Turkic case foldings or conditional case mappings, and the runtime 
code has hardcoded conditions and mappings corresponding to the data files.

Date/Time: Mon Dec 19 06:33:37 CST 2011
Contact: ikeda@conversion.co.jp
Name: IKEDA Soji
Report Type: Public Review Issue
Opt Subject: Hangul tone marks

This was answered by the Ed Committee 2011/12/19


I realized that general caregory of hangul tone marks were changed from Mn 
(combining nonspacing) to Mc (combining spacing).  I propose that they shall 
be Mn.

UnicodeData.txt of 6.0.0:
302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;;;;;N;;;;;
302F;HANGUL DOUBLE DOT TONE MARK;Mn;224;NSM;;;;;N;;;;;

UnicodeData-6.1.0d9.txt:
302E;HANGUL SINGLE DOT TONE MARK;Mc;224;L;;;;;N;;;;;
302F;HANGUL DOUBLE DOT TONE MARK;Mc;224;L;;;;;N;;;;;

They are tone marks used in Old Korean texts which consist of vertical lines.  
These dots are placed on the left side of each character, not between the characters.

Analogously on horizontal texts, cedilla (U+0327) protrudes into bottom side 
of base characters, but it might not be concerned as spacing.


Thank you.

Date/Time: Fri Dec 9 02:11:59 CST 2011
Contact: jan.nijtmans@gmail.com
Name: Jan Nijtmans
Report Type: Error Report
Opt Subject: Four characters have their TOTITLE character set to its default value


Consider the characters 01C5, 01C8, 01CB and 01F2 in UnicodeData-6.1.0d9.txt

01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5
01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J;Lt;0;L;<compat> 004C 006A;;;;N;LATIN LETTER CAPITAL L SMALL J;;01C7;01C9;01C8
01CB;LATIN CAPITAL LETTER N WITH SMALL LETTER J;Lt;0;L;<compat> 004E 006A;;;;N;LATIN LETTER CAPITAL N SMALL J;;01CA;01CC;01CB
01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2

They all have their TOTITLE entry set to the character itself. No other
characters do that: It's the default value anyway. This change, which
was done in Unicode 3 (In Unicode 2.1-update4 it was correct), is
what caused Tcl bug 3444754, because Tcl's tooling was not adapted
to that. See:
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3444754&group_id=10894

However, because it is not consistent with other characters, which never list
toupper/tolower/totitle entries pointing to itself, I would like to report
that here anyway, proposing to replace those 4 entries to:

01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;
01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J;Lt;0;L;<compat> 004C 006A;;;;N;LATIN LETTER CAPITAL L SMALL J;;01C7;01C9;
01CB;LATIN CAPITAL LETTER N WITH SMALL LETTER J;Lt;0;L;<compat> 004E 006A;;;;N;LATIN LETTER CAPITAL N SMALL J;;01CA;01CC;
01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;

Regards,
          Jan Nijtmans

Date/Time: Thu Dec 29 17:20:48 CST 2011
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Potential issue with 6.1 NameAlias.txt


The proposed NameAlias.txt file omits 4 aliases that RL2.5 of UTS#18 says 
should be created.  These all have parentheses in their names, so there 
is no danger of them accidentally being introduced as conflicting names.  
I don't know if the file should include these aliases that have long been 
called for in UTS#18.  But I was surprised that they weren't there.  The 
aliases are:
000A   LINE FEED (LF)
000C   FORM FEED (FF)
000D   CARRIAGE RETURN (CR)
0085   NEXT LINE (NEL)

-------

The follow up should be that the the UTC clarifies that the UTS#18 
specification was broken, when it asks for support of the Unicode 1.0 
name field. What was meant was, and what got created now, is instead 
of a literal support for a

LONG (short)

format, both LONG and short alias were intended to be individually 
supported.

In addition, it could be pointed out that: Given that parentheses don't 
enter into aliases, implementations are free to support this mixed format 
for compatibility with past bugs, without running the risk of introducing 
incompatibiities with future aliases.

This could take the form of a note for R.L. 2.5 in a future revision of UTS#18

A./

Date/Time: Fri Jan 6 06:36:05 CST 2012
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: aliases in 6.1.release-cand.


I think that

1) The aliases should be listed thus:
First the (one!?!) name used in the other datafiles, followed by 
other names.

2) Each alias name should be listed only once, no two (or more) 
identical (modulo the matching rules) names in the list.

Date/Time: Fri Jan 6 06:38:09 CST 2012
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: aliases in 6.1.release-cand.


For the diff files, one should have the principle of making them as 
small as possible, thus using "old" names rather than "new" names (aliases).

Date/Time: Thu Jan 26 12:17:35 CST 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: difference between UCA_Rules_SHORT.txt & FractionalUCA_SHORT.txt: prefixes vs. contractions


FractionalUCA_SHORT.txt has the following 4 weights conditional on 
*prefixes* (see the lines with the | symbol):

...
0141; [3D, 05, 8F][, D0 3D, 05]
006C | 00B7; [, DB A9, 05]
006C | 0387; [, DB A9, 05]
0140; [3D, 05, 05][, DB A9, 05]
004C | 00B7; [, DB A9, 05]
004C | 0387; [, DB A9, 05]
013F; [3D, 05, 8F][, DB A9, 05]
...

UCA_Rules_SHORT.txt has *contractions* for these instead:

<<<     ㋏ / Td
<<     l·
    =     l·
    =     ŀ
<<<     L·
    =     L·
    =     Ŀ

The two representations should be equivalent. Therefore, these collation 
elements should rather be prefix-conditional in the rule form as well, as follows:

... (sequence of primary ignorables, up to the last one)
<< \u006C | \u00B7
  = \u006C | \u0387
  = \u004C | \u00B7
  = \u004C | \u0387

and then expansions for the compatibility composites like this (or something equivalent)
&\u006C\u00B7 = \u0140
&\u004C\u00B7 = \u013F

Correspondingly, it would also be better to list these weights in 
FractionalUCA_SHORT.txt at the end of the primary ignorables rather than among the U+006C variations.

Feedback on Encoding Proposals

This feedback from John Cowan is carried forward from last time:

Date/Time: Thu Oct 27 02:23:23 CDT 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/11-373 Proposal to encode Linguistic Doubt Marks in the UCS


The proposal says:

"In theory, [the proposed COMBINING QUESTION MARK ABOVE and BELOW]
could be considered as glyph variant[s] of the same underlying
character. However, there is no precedent of a combining character
which has no fixed placement relative to the base letter, and
especially there is no combining class indicating such a placement
variation."

Cedilla is a combining mark below, but U+0123 LATIN SMALL LETTER G
WITH CEDILLA is rendered with an inverted cedilla above, despite its
decomposition into "g" and COMBINING CEDILLA (not *COMBINING INVERTED
CEDILLA ABOVE, which does not exist).

Similarly, the IPA does not distinguish between diacritics above and
below, and leaves it up to font designers when to use exceptionally
placed diacritics.

Date/Time: Thu Jan 19 04:23:26 CST 2012
Contact: satai@akauri.com
Name: Alex Ostrovsky
Report Type: Feedback on an Encoding Proposal
Opt Subject: Encoding Georgian and Nuskhuri letters for Ossetian and Abkhaz


The document N3775 (L2/10-072, 2010-02-17) "Proposal for encoding 
Georgian and Nuskhuri letters for Ossetian and Abkhaz" proposes to 
add (among others) YN and AEN letters to the Mkhedruli chart of the 
Georgian block as the both Khutsuri charts (in the Georgian and the 
Georgian Supplement blocks).

Since both Khutsuri YN and AEN letters are attested for Ossetian 
[Bible publications] only, the Khutsuri sections are named "Additional 
letters for Ossetian", while corresponding Mkhedruli sections are 
called "Additional letters for Mingrelian and Svan" (YN) and 
"Additional letters for Ossetian and Abkhaz" (AEN). However, 
Georgian YN is used for Mingrelian, Svan and Abkhaz as well, and 
nowadays Khutsuri is used by Georgian Orthodox Church. Thus, there 
is a potential to use Khutsuri YN in Mingrelian or Svan texts in 
future and it is much more probable than use of Khutsuri YN letter for Ossetian.

Because of above, I would like to propose more neutral solutions:
1) Either rename "Addition letters for Ossetian" subhead into 
"Additional letters" one for both upper- and lower-case Khutsuri charts;
2) or split "Addition letters for Ossetian" subhead into "Addition 
letters" with YN and "Addition letters for Ossetian" with AEN.
Personally I would incline to the second solution, because it keeps 
things arranged better and eliminates necessity in "reserved" codes in subheads.

Thank you,
Alex.

Date/Time: Thu Jan 26 13:41:45 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Proposal to Add Three Characters to UTR #45


Perhaps these should be encoded not as ideographs but as symbols,
in the manner of circled and parenthesized ideographs?

Date/Time: Thu Jan 26 15:07:44 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Proposal to Encode Medieval East-Slavic Musical Notation in Unicode


The name TSEFAUT CLEF is dreadful.  If it must be retained, at 
least make it CE-FA-UT CLEF.  However, C-CLEF would be far better in my opinion.

Date/Time: Sun Feb 5 22:01:30 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-072 Proposed UCD property: Script Identifier Status


Just to make sure this is fixed before it's frozen forever: 
it's "Aspirational", not "Asperational".

Date/Time: Sun Feb 5 22:19:17 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-067 Characters with Multiple Accents (e.g. Lithuanian), Recent Keyboard Standards, and Microsoft’s MSKLC


While it's true that the Windows dead-key model only allows the 
generation of a single diacritic, it is also true that MSKLC allows 
the creation of keys which directly generate Unicode combining characters.  
In this style, it would be possible to generate e-ogonek-acute by pressing 
the e key, the combining (not dead) ogonek key, and the combining 
(also not dead) acute key.  This would send three characters to the 
application, which could then appropriately normalize them.

L2/12-042