Comments on Public Review Issues

L2/08-388

Comments on Public Review Issues
(August 8 - October 28, 2008)

The sections below contain comments received on the open Public Review Issues as of October 28, 2008, since the previous cumulative document was issued prior to UTC #116 (August 2008).

123 Bengali Currency Numerator Values
124 Proposed Update UTR #23: The Unicode Character Property Model
125 Proposed Update UTR #33: Unicode Conformance Model
126 Proposed Update UTR #17: Unicode Character Encoding Model
127 Proposed Update UAX #44: Unicode Character Database
128 Proposed Update UTS #37: Unicode Ideographic Variation Database
129 Code Point Labels: Suggested Wording Details
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues

123 Bengali Currency Numerator Values

No feedback was received via the reporting form this period.

124 Proposed Update UTR #23: The Unicode Character Property Model

No feedback was received via the reporting form this period.

125 Proposed Update UTR #33: Unicode Conformance Model

No feedback was received via the reporting form this period.

126 Proposed Update UTR #17: Unicode Character Encoding Model

No feedback was received via the reporting form this period.

127 Proposed Update UAX #44: Unicode Character Database

Date/Time: Mon Oct 13 20:47:50 CDT 2008
Contact: adamsmd@cs.indiana.edu
Name: Michael D. Adams
Opt Subject: Improvement for UAX #44

Regarding "UAX #44", version "Unicode 5.2 draft 2", revision 3:

(Disclaimer: I am a stranger to Unicode process so I apologize if anything in the form or content of this note offends or is out of style.)

Under Section 4.2.3 bullet 3, it is noted that when specifying a range in UnicodeData.txt (i.e. "First" and "Last" in angle brackets) "the names of all characters in the range are algorithmically derivable. See [Unicode] for more information on derivation of character names for such ranges."

I submit that this could be improved. Currently it is very unfriendly to implementers. The citation "[Unicode]" without reference to more specific information about what section in the Unicode Standard makes it difficult to find how to derive character names. As it is one must scour the entire standard in hopes of finding the thing being referenced.

Simply referencing the specific section where information of derived character names is would be an improvement in my opinion. However, it would be even better if that information could be given directly in UAX #44 so that UAX #44 is more self contained and all of the information necessary to extract character names from UnicodeData.txt is provided in UAX #44.

I realize that there may be issues with keeping UAX #44 in sync with the Unicode Standard that make my specific suggestions impractical, but anything to help someone implementing a parser for UnicodeData.txt find the information on derived character names would be appreciated.

128 Proposed Update UTS #37: Unicode Ideographic Variation Database

Date/Time: Fri Oct 17 00:57:12 CDT 2008
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Opt Subject: Issue #128 Update UTR #37: IVD : content of the description

[quote]

B.2 Describing the collection and its content

Assuming that some party "Example" wishes to express the restriction above in plain text, they would create a collection for this kind of situation. This collection could be targeted at the representation of person and place names, for example. They would put a description of this collection on their web site, at "http://www.example.com/names", which could look like this:

This collection of glyphic subsets is intended for the representation of person and place names in Japanese. Elements in this collection are identified by an integer (i.e. match the regular expression “[0-9]+”).

It currently contains a single glyphic subset:

Base Unified Ideograph: U+82A6

Identifier in this collection: 23

Glyphic subset: there is a single horizontal stroke in the radical; the top stroke below the radical is attached and slanting up. Thus is in the glyphic subset, and are not. [/quote]

Such thing is not very cute: the link is assumed to contain any kind of HTML, including active scripts, access cookies, and various assumptions about the type of browser used to render it.

It seems that the link should not point to an HTML page but to a XML document where the various information describing the collection is given a parsable semantic.

The XML document should then contain these fields, in a document schema, like

<%xml encoding="UTF-8" standalone="no" version="1.0">
<!DOCTYPE unicode-ivd "http://www.unicode/schemas/ivd-collection">
<ivd-collection>
  <descriptions>
    <description xml:lang="en">This collection of glyphic subsets is
    intended for the representation of person and place names in Japanese. 
    Elements in this collection are identified by an integer (i.e. match the 
    regular expression “[0-9]+”).
	It currently contains a single glyphic subset.</description>
  </descriptions>
  <subset id="23">
    <base>芦</base>
    <sampleglyphs>
      <base name="images">
        <mirror>http://mirror1.example.com/ivd/23/</base>
        <mirror>http://mirror2.example.com/ivd/23/</base>
        <source>http://www.example.com/ivd/23/</source>
      </base>
      <glyph id="82A6-glyph1">
        <img base="images">82A6-glyph1.svg</img>
        <img base="images">82A6-glyph1.png</img>
        <img base="images">82A6-glyph1.gif</img>
      </glyph>
      <glyph id="82A6-glyph2">
        <img base="images">82A6-glyph2.svg</img>
      </glyph>
      <glyph id="82A6-glyph3">
        <img base="images">82A6-glyph3.svg</img>
      </glyph>
      <glyph id="82A6-glyph4">
        <img base="images">82A6-glyph4.svg</img>
      </glyph>
    </sampleglyphs>
    <variations>
      <variation id="var-1">
        <ivd>󠄁</ivd><!-- filled when the IVD is officially registered -->
        <members>82A6-glyph1</members>
        <non-members>82A6-glyph2, 82A6-glyph3, 82A6-glyph4</non-members>
        <descriptions>
          <description xml:lang="en">there is a single horizontal stroke in the radical;
          the top stroke below the radical is attached and slanting up. Thus the glyph 
          <img target="glyph-1" /> is in the glyphic subset, and others are not.</description>
        </descriptions>
      </variation>
    </variations>
  </subset>
</ivd>

It is questionable if the <description> elements should contain any kind of XHTML, of just a minimum subset of it. I don't think it should contain any CSS styling, or reference to fonts, but basic presentation may be possible (using ..., ..., and . The <img> tag refers to the sampleglyphs that are part of the subset (both members or non-members of the variation).

The format for the glyphs should be limited to standard image types: GIF, JPG, PNG, SVG, excluding everything else (no animation in GIF, PNG or SVG as in this example).

This would allow a user-interface to retrieve the information, and display it conveniently in a UI, allowing the user to choose the IVD appropriately. The XML document may contain several descriptions for the collection, for the subsets, or the variations in the subset, permitting several languages.

It would also allow agregator services to collect the description.

May be extra elements could contain for the collection:

a copyright notice
the author or source name
a contact address (or a link to visit with a web browser)
distribution permissions and licencing info (allowing or disallowing the aggregation of many IVD collections by external services, which would be caching or archiving them, or allowing them to be reproduced in external websites or books). If the permission is denied, because it requires fees or a signed licence, then external aggregator services will not be able to mirror the XML file, but could still reference it. (Generally the permission could be needed for the glyph images, but not the descriptions in the document).

When mirroring the collection, the <base> element for sample glyphs should be kept (as a source reference, unless the permission allow full copies and modification, something that would not be very acceptable if the IVD must be registered and stable), but one or more <mirror> elements could be inserted before it and the <glyph> elements, within the <sampleglyphs> element (and it would be up to the application to choose which mirror to use for retrieving the images).

Some glyphs may also have several variants.

The above example is just a sample, not a reference, the exact XML schema is left to discussions, as well as the policy for using the referenced descriptions that the proposed IVD database would link to.

I am also wondering if the assigned IVD should be part of the XML document itself, or if it should be part of the IVD database referencing it. And the terms are not clear about what a "subset" is supposed to designate: can it reference all the variations and sample glyphs documented by a source, including for several IVDs related to distinct base character ?

Date/Time: Mon Oct 27 03:18:01 CST 2008
Contact: mpsuzuki@hiroshima-u.ac.jp
Name: suzuki toshiya
Opt Subject: Comment on PRI#128

Dear Sirs,

Following is my comment on Public Review Issue #128, Proposed Update of UTS#37 "Ideographic Variation Database".

Comment 1 (editiorial) Note for the revision 4: "Udpate" can be a mistake of "Update".

Comment 2 (technical) "The same considerations apply to other traditional glyph variants, which may or may not be distinguished by the unification rules."

I think "traditional glyph variant" is slightly ambigious description, because this sentence is the first one using the word. It can remind "kTraditionalVariant" in Unihan.txt, but I guess "traditional" in here is more generic meaning. In following, I use traditional/simplified as a word to distinguish PRC's simplified shapes and their original shapes.

In ISO/IEC 10646 Annex S.1.4.3, some ununifiable pairs can be understood as simplified/traditional (e.g. U+533A/U+5340), and others cannot be (e.g. U+62E1/U+64F4).

For example, according to Unihan.txt, U+342E is a semantic variant of U+8944. The difference of glyph shapes is too large to be unified by ISO/IEC 10646 Annex S, and Unihan.txt defines no relation- ship of kTraditional/SimplifiedVariant for this pair. There are similar pairs: U+58CC/U+58E4, U+5B22/U+5B43, U+7A63/U+7A70, etc.

On the other hand, there are the ideographs that their possible pairs (synthesized by the substi- tution from U+8944 to U+342E) are not coded: U+6518, U+703C, U+79B3 etc. If somebody want to handle the variants of these ideographs, they should be registered as new ideograph because they are not unifiable? Or, it is acceptable to register the IVS for these variants as a glyphic variant of exisiting characters?

There are several levels how strong unification rule should be applied. Here I list 5 levels, from weak to strong:

1) When a base character is given, there's no restriction about the glyphic differences of its variant shapes in IVS.

2) When a base character is given, the glyph differences of its variant shapes in IVS should not cause semantic differences.

3) When a base character is given, the glyph differences of its variant shapes in IVS should be unified by ISO/IEC 10646 Annex S, or by regional standards listed in ISO/IEC 10646 Annex S.1.6.

4) When a base character is given, the glyph differences of its variant shapes in IVS should be unified by ISO/IEC 10646 Annex S.

5) When a base character is given, the glyph differences of its variant shapes in IVS should be unified by ISO/IEC 10646 Annex S. If the glyph difference makes the variant shape as an intermediate shape among the characters separately coded by source code separation rule (ISO/IEC 10646 Annex S.3), the base character should be chosen as the character that provides the nearest shape.

I think the proposed update of UTS#37 is inten- ded to state that the application of ISO/IEC 10646 Annex S for IVS registration is LESS THAN level 5. I guess previous registration of IVS Adobe-Japan1-6 was designed for level 4.

Considering the difficulty to identify the glyphic differences among the characters separated by source code separation rule, I think the level 4 would be the appropriate strength.

Giving my personal opinion, I wish if level 3 application is permitted. In Japanese regional standards, some characters unifies multiple characters that distinguished in ISO/IEC 10646 Annex S. For example, U+8346/U+834A (S.1.4.2) are not unifiable in ISO/IEC 10646 at all, but they are unified in JIS X 0208 and 0213 (forced to say, "source code unification"). To interchange a legacy coded text with concrete glyph specification, level 3 is better than level 4.

Apparently, the proposed update clarifies that the IVS registration submission is NOT required to conform the source code separation rule. I wish it is clarified whether the IVS regist- ration submission is permitted, or not permitted to rely "regional or legacy source code unifi- cation" rule.

The pair of U+8944/U+342E is NOT unified in JIS X 0208 basically, although an exception arised in the transition from JIS C 6226-1978 to JIS X 0208-1983. To permit such substition, level 2 strength would be appropriate. But it's questionable if level 2 is realistic, because level 2 requires some database of semantics of each characters/glyphs.

Regards,
suzuki toshiya @ Hiroshima University

129 Code Point Labels: Suggested Wording Details

Date/Time: Wed Oct 15 03:04:20 CDT 2008
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Opt Subject: PR129

To be a little bit more minimalistic, <U+(xx)xxxx> (with the code point number) should be sufficient as "code point labels" when there is no assigned name. But I can agree that giving a classification (non-character, private-use, ...) is a little bit more helpful. I'm not so sure it should be applied to the unstable 'reserved' codepoints (they may get assigned in a future version).

However, for the C0 and most of C1 control characters, there are assigned names, albeit not in Unicode per se. At least for the C0 control characters those names should be used as code point labels. For instance, <CARRIAGE RETURN> is more helpful than <control-000D> or <U+000D>.

Date/Time: Tue Oct 28 14:44:24 CST 2008
Contact: markus.icu@gmail.com
Name: Markus Scherer
Opt Subject: Issue #129 Code Point Labels

"Outstanding issue: The UTC will need to determined whether Code Point Labels, as defined here, will be considered immutable. That is, would such labels be considered formally a Unicode code point property, and if so, be unchangeable once assigned."

I suggest to not treat code point labels as a formal code point property (I see no need), and not declare them immutable. (It might be useful to later introduce further code point types, such as for unassigned code points that have properties like "ignorable" or RTL.)

In my view, aside from editorial considerations for the Unicode Standard itself, defining code point labels in the standard is mildly useful for character name APIs as a guideline for names of unnamed code points.

FYI: ICU has for many years supported similar labels in its character name API. When asking for an "extended character name", and a standard name is not defined for a code point, then a pseudo-name is constructed where the prefix is more or less the name of the general category of the code point (using "unassigned" for Cn and "private use area" for Co), or "noncharacter", "lead surrogate", or "trail surrogate". Then a dash and the hexadecimal code point is appended just like in the proposal. In other words, for code points without normal character names, there are (only) minor spelling differences between the proposed labels and ICU's "extended character names".

Other Reports

Date/Time: Thu Aug 7 06:28:51 CDT 2008
Contact: razvan.sandu@zando.ro
Name: Răzvan Sandu
Report Type: Error Report
Opt Subject: [RO] Invalid characters for Romanian language

Hello,

Regarding the following documents that appear on Unicode webpage:

a. http://www.unicode.org/charts/PDF/U0100.pdf
b. http://www.unicode.org/charts/PDF/U0180.pdf

please note a subtle error error (which is not obvious to non-Romanian speakers).

The following Unicode characters (cedilla-below):

- "S with cedilla below" (Unicode O1E)
- "s with cedilla below" (Unicode O1F)
- "T with cedilla below" (Unicode 0162)
- "t with cedilla below" (Unicode 0163)

are listed as acceptable for Romanian language; in fact, these characters are simply NOT A PART of the Romanian alphabet (according to Romanian Academy official rules, as every Romanian pupil learns in school's first grade).

Actually, the acceptable characters for the Romanian language are the "comma-below" ones, namely:

- "S with comma below" (Unicode 0218)
- "s with comma below" (Unicode 0219)
- "T with comma below" (Unicode 021A)
- "t with comma below" (Unicode 021B)

Please correct the Unicode standard so "cedilla-below" characters DO NOT APPEAR AS ACCEPTABLE FOR ROMANIAN LANGUAGE nowhere in the text.

Thanks a lot, Răzvan

Date/Time: Sun Aug 10 10:30:41 CDT 2008
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Technical Report or Tech Note issues
Opt Subject: IDNAbis preprocessing

With the transition from IDNA2003 to IDNAbis, I think it's time to break the backwards-compatibility features of IDNA2003 that are just errors or infelicities, as opposed to the ones that actually help users. Thus the newer default ignorables should be removed, the five uppercase-only letters should be properly lowercased, and the five bad hanzi should get their proper decompositions. The likelihood of this breaking any actual domain names is vanishingly small.

Date/Time: Mon Aug 25 21:44:12 CDT 2008
Contact: baturu@daicing.com
Name: MA Xudong
Report Type: Error Report
Opt Subject: Errors about Manchu and Sibe Letters

http://unicode.org/charts/PDF/U1800.pdf

The Sibe letters were modified/created based on Manchu letters and most of the Sibe letters and Manchu letters are the same. But you have definited the same letters twice and makes them look different.

The following is not correct:

========================================
Manchu R = uni1875 -> Sibe R = uni1837
Manchu J = uni1835 -> Sibe J = uni186A
Manchu J’ = uni1877 -> Sibe J’ = uni1872
========================================
Table 1

In fact, Manchu and Sibe share the same R, J and J’. My opinion is in Table 2.

========================================
Manchu R = Sibe R = uni1837
Manchu J = uni1835 = Sibe J
Manchu J’ = Sibe J’ = uni1872
========================================
Table 2

My suggestions are:

1: Remove the current uni1875, uni186A and uni1877.

2: At the same time, change the forms of uni1872 to make it look like the current 1877.

3: Add finavar1 to uni1837 (this one should looks like the final form of current uni1875). This is optional.

Thank you very much. Hope to hear from you soon.

baturu@daicing.com
xudong.ma@gmail.com

Date/Time: Tue Aug 26 11:44:17 CDT 2008
Contact: chris@casabasecurity.com
Name: Chris Weber
Report Type: Error Report
Opt Subject: BOM's occuring in the middle of HTML files and javascript

I keep reading the specification, and it seems clear that a BOM occurring midway through a file should be treated as a ZWNBSP or WORD JOINER. But in the case of a markup language or data protocol, you make an exception that this scenario can be 'ignored, or treated as an error.’ That seems open to interpretation too much.

In the case of ignored, that means that HTML with a <sc[U+FEFF]ript> could be legally interpreted as <script>. Same type of thing in javascript. I’m happier with the error condition statement, but I’m finding to many security bugs having to do with ‘ignore’. I've reported security issues to several browser vendors for their products, and I'm wondering if the Unicode specification here might need more clarity.

Thanks,
Chris

From http://unicode.org/faq/utf_bom.html#38

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of the file can be ignored, or treated as an error. [AF]

Date/Time: Fri Oct 3 07:16:29 CDT 2008
Contact: freed.design@gmail.com
Name: Daniel Freedman
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: Request for inclusion in Hebrew Unicode Standard

HEBREW PROBLEMS

There are a number of letters that when included with a diacritic mark are subject to non-conforming to Unicode standard principles.

1. I posit that there needs to be correction when it comes to a metheg (05BD) together with a Chatuf letters (05B1, 05B2, 05B3) as well as conflicts with other vowel-pointers (05B0, 05BB 05B4, 05B5, 05B6, 05B7, 05B8, 05C7). There needs to be an extra definition of a metheg that is a left metheg and a far left metheg, and a right metheg.

2, There is a difference between a Shewa-Na (05B0) and Shewa Nach (05B0), It is used differently, and should be made that there is a different code. Many do not make a distinction, but there is. Others use a squiggle like structure (like 05A7, but narrower) next to the Shewa-Na

3. There is a difference between a Dagesh Kal and Dagesh Chazak (05BC), this is not indicated in the codes. Many do not make the distinction, but there is.

4. The Lamed (05DC) has a top ligature that is often removed for space constraints)

5. The upper dot (05C4) is not placed correctly when in use with a shin - it conflicts with the other dots.

6. The cantillation markings often conflict with the vowel pointers etc.

Date/Time: Thu Oct 16 10:45:45 CDT 2008
Contact: sarah_gregory@paradise.net.nz
Name: Gregory
Report Type: Problems / Feedback about website
Opt Subject: small typo on Normalization Charts Instruction page

Hi there,

There's a small typo on the second to last bullet on the Normalization Charts Instruction page, it says: "...browser supports tool-tops, then hovering...". Perhaps it should say: "...browser supports tool-tips, then hovering...". (tool-tips, not tool-tops).

Thanks.

Feedback on Encoding Proposals

Date/Time: Mon Aug 11 14:23:02 CDT 2008
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/08-081

One trouble with the FLAG XX proposal is that ISO 3166/MA *does* reassign codes to new countries, and changes codes when countries change names. FLAG CS would be ambiguous between "Flag of former country Czechoslovakia" and "Flag of former country Serbia and Montenegro". The current 3166/MA policy is not to reassign a code until fifty years have passed, but that is a short time in the life expectancy of the Unicode Standard.

For this and other reasons I oppose it.

Date/Time: Mon Oct 20 19:16:03 CDT 2008
Contact: leob@mailcom.com
Name: Leo Broukhis
Report Type: Feedback on an Encoding Proposal
Opt Subject: N3469 notes

Typo: the property line for 26F3 says "FLAG IN POLE" instead of "FLAG IN HOLE" (page 13).

The canonical names of the characters are inconsistent wrt descriptiveness vs semantics:

26D2 "CIRCLED CROSSING LINES = road closed" vs 26D4 "NO ENTRY" (rather than, say, "black circle with heavy white horizontal bar")

26EE "GEAR WITH HANDLES = power plant or power substation" vs 26ED "FACTORY" (rather than, say, "gear with long cogs") - and why is it disambiguated from 2699 "GEAR"? Isn't that just a presentation difference (cf. equating ARIB 9383 with U+2603)

26F6 "SQUARE FOUR CORNERS = intersection" vs 26DA "DRIVE SLOW" Maybe 26DA "SQUARE FOUR CORNER DOTS = drive slow"?

26F0 "MOUNTAIN"? Shouldn't it be "LARGE BLACK UP-POINTING TRIANGLE = mountain" -> 25B2? (Cf. 26DB)

All traffic symbols implying left way traffic should be named making it explicit - preparing for addition of the corresponding right way traffic symbols; given that the right way traffic is much more widespread, this will avoid future confusion.

Also 26DC may be better named "WHITE SEPARATED LEFT LANE MERGE", and 26EF along the lines of "WHITE CIRCLE WITH CENTER DOT AND RAYS". 26E0 and 26E1, unfortunately, do not lend themselves to a short description.

26FB "JAPANESE BANK SYMBOL" seems to violate the rule of avoiding references to nations in character names (cf. U+262B)? "VERTICAL BOBBIN WITH ROUNDED ENDS" may be more appropriate.

Closed Public Review Issues

Date/Time: Mon Oct 6 06:32:58 CDT 2008
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Error Report
Opt Subject: Error in /Public/UCA/5.1.0/allkeys.txt

The following two lines in /Public/UCA/5.1.0/allkeys.txt are erroneous:

1EFA ; [.1262.0020.0004.1EFA][.1262.0020.0004.1EFA] # LATIN CAPITAL LETTER MIDDLE-WELSH LL; QQKN
1EFB ; [.1262.0020.000A.1EFB][.1262.0020.000A.1EFB] # LATIN SMALL LETTER MIDDLE-WELSH LL; QQKN

They should be:

1EFB ; [.1262.0020.0004.1EFB][.1262.0020.000A.1EFB] # LATIN SMALL LETTER MIDDLE-WELSH LL; QQKN
1EFA ; [.1262.0020.000A.1EFA][.1262.0020.0004.1EFA] # LATIN CAPITAL LETTER MIDDLE-WELSH LL; QQKN

L2/08-388

Comments on Public Review Issues (August 8 - October 28, 2008)

Contents:

Comments on Public Review Issues
(August 8 - October 28, 2008)