Comments on Public Review Issues
(May 5, 2011 - July 27, 2011)

The sections below contain comments received on the open Public Review Issues and other feedback as of July 27, 2011, since the previous cumulative document was issued prior to UTC #127 (May 2011).


177 Proposed Update UTS #46: Unicode IDNA Compatibility Processing
179 Changes to Unicode Regular Expression Guidelines
183 Supplementary Registration of the AdobeJapan1 Collection
184 Proposed Update UTS #37: Unicode Ideographic Variation Database (IVD)
185 Revision of UBA for improved display of URL/IRIs
186 Word-Joining Hyphen
187 Second registration of sequences for the Hanyo-Denshi collection
188 Proposed Update UAX #9: Unicode Bidirectional Algorithm
190 Proposed Update UAX #14: Unicode Line Breaking Algorithm
191 Proposed Update UAX #15: Unicode Normalization Forms
192 Proposed Update UAX #24: Unicode Script Property
193 Proposed Update UAX #29: Unicode Text Segmentation
194 Proposed Update UAX #31: Unicode Identifier and Pattern Syntax
196 Proposed Update UAX #38: Unicode Han Database (Unihan)
197 Proposed Update UAX #41: Common References for Unicode Standard Annexes
198 Proposed Update UAX #42: Unicode Character Database in XML
199 Proposed Update UAX #44: Unicode Character Database
200 Draft UTR #49: Unicode Character Categories
201 Draft UTR #45: U-Source Ideographs
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports - UTS #10 (Sorting)
Other Reports - UTR #25 (Math)
Other Reports

177 Proposed Update UTS #46: Unicode IDNA Compatibility Processing

No feedback was received via the reporting form this period.

179 Changes to Unicode Regular Expression Guidelines

No feedback was received via the reporting form this period.

183 Supplementary Registration of the AdobeJapan1 Collection

No feedback was received via the reporting form this period.

184 Proposed Update UTS #37: Unicode Ideographic Variation Database (IVD)

Date/Time: Fri May 27 04:09:10 CDT 2011
Contact: nobuyoshi.mori@sap.com
Report Type: Error Report, UTS #37
Opt Subject:

It seems to me, that the Appendix B. Hypothetical Example in http://www.unicode.org/reports/tr37/tr37-6.html is not quite striking for reasons :

1) The official site of Ashiya city uses the variant 3 http://www.city.ashiya.lg.jp/ - there are many Ashiyas and there might be an another City which is customarily written in the variant 4, though. - Ashiya city seems to mix also variant 4 in their texts.

2) The red colorings of Ashi-da and Ashi-ya in the example suggests that experienced Japanese readers read character-wise While they recognize words in also in Japanese texts.

3) I see similar examples in http://ja.wikipedia.org/wiki/%E5%85%B5%E5%BA%AB%E7%9C%8C%E5%87%BA%E8%BA%AB%E3%81%AE%E4%BA%BA%E7%89%A9%E4%B8%80%E8%A6%A7 U+82A6 is used various times within one page both for Personal names and city names.

My question is, whether Appendix B could be dropped. My alternative suggestion would to take more classical example of U+9435 and U+9244.

Old : 鐵工所 U+9435 U+5DE5 U+6240 New : 鉄工所 U+9244 U+5DE5 U+6240

And the older variant is explicitly in use as a proper name.

Example :


But I assume that it could be controversial that U+9435 and U+9244 are Variants. Therefore dropping Appendix B would be less controversial.

My apologize that I am posting this content very late.


Date/Time: Sun Jul 24 07:12:55 CDT 2011
Contact: mpsuzuki@hiroshima-u.ac.jp
Name: suzuki toshiya
Report Type: Public Review Issue
Opt Subject: Comment on PRI#184 (Proposed Update UTS#37)

I have 2 comments for proposed change to UTS#37, in consideration with the idea of shared sequence.

P1) More detailed text is required to clarify the similarity of the glyphic subset. There are many people caring about the similarity of the representative glyphs, but not so much about the similarity of the glyphic subset.

P2) To encourage the sharing of the sequence strongly, the registration authority is expected to reconsider the system of the registration fee, to motivate the registrants to share the existing sequence than register new sequence.

Before my comment, the proposed change for 4.2: "The registrant may also supply additional repsentative glyphs for registered sequences of an existing collection." is expected to clarify the ownership of the representative glyph and the sequence. I think it means that: the registrant may also supply additional representative glyphs *owned by the registant* for registered sequences of an existing collection *owned by the registrant*.

Of course, if there is a mutual agreement between the owner of the collection (A) and another owner of the sequence (B), the registrant B may supply the glyph owned by A to the sequence owned by B.

P1) requirement of more detailed text about "glyphic subset"

The proposed change has a new paragraph saying: "If there are sequences that correspond to the same glyphic subset, ...registrants are strongly encouraged to share sequences where sequences in a submission are similar to those in an existing collection".

It introduces an idea of the comparison of the glyphic subsets among the different registrations (rather, existing registration and planned registration in the preparation). The previous versions have no intention to do such, it was (is) written as: "it does not guarantee that two different IVSes on the same unified ideogrpah have non-overlapping or even distinct glyphic subsets".

Although Appendix B shows an example how to define the glyphic subset by the descriptive text, like, "there is a single horizontal stroke in the radical; the top stroke below the radical is attached and slanting up", both of Adobe-Japan1 and Hanyo-Denshi have not included such text in the part of uncancellable registration.

Thus, previous versions of UTS#37 and the registrants of existing IVD/IVSes are supposed to have no (strong) intention to help the comparison of the glyphic subsets. As a result, more detailed text with examples is needed to encourage the comparison of the glyphic subsets.

P2) proposal of reconsideration of registration fee system

I guess the explicit permission to assign multiple representative glyphs to single sequence is introduced to help the comparison of the glyphic subsets (by showing various instances of a glyphic subset). But the clarification of the glyphic subset for a sequence by adding marginal glyph instances require efforts, and sometimes the cost for the clarification would be too expensive for the registrants. Thus, the noting "strongly encouraged" is insufficient to encourage the sharing strongly. For the registrants who cannot afford to investigate the cross section of the glyphic subsets. Many users/developers are willing to discuss about the similarities of the representative glyphs, but not willing about the glyphic subsets. Sometimes, the discussion of the glyphic subsets is difficult for the registrants themselves (as the discussion of unifications in JIS X 0208 charset caused another charset JIS X 0213 charset with incompatible unification rules).

Considering such difficulties, IVD would have 2 different registerations; one is the collection which the glyphic subset info are continuously/timely maintained to be clearer, another is the collection which the glyphic subset info are actually freezed, or cannot be clarified timely. The future registrants should consider the sharing of the sequence with the former collections, not with the latter collection.

Often the designers of new registrations may have their own representative glyph collection and they may want to assure that their representative glyphs are included in the glyphic subsets of the sharable sequence. At present, it is noted that the registration fee is charged to both of the registration of the collection and sequences. Thus, even when new registrant find no new sequence is required and just they want to include their representation glyphs into the existing/shared sequence, still the registration fee is required.

Is there a possibility of the registration fee exemption for the registration of the additional representation glyphs to existing sequence? If there is, how about the case that the registrant of additional representative glyph is not the owner of the sequence (but with mutual agreement)? If the registration of the additional representation glyphs only for the clarification of the glyphic subset of existing sequences can be done with smaller, or without registration fee, the sharing of the sequence would be strongly motivated, and the glyphic subset of the shared sequence would be clearer and reliable.

suzuki toshiya, Hiroshima University, Japan

Date/Time: Sun Jul 24 11:26:43 CDT 2011
Contact: fantasai@inkedblade.net
Report Type: Public Review Issue
Opt Subject: Official W3C CSSWG Comment on PRI184

This comment is being sent officially on behalf of the W3C CSS Working Group with respect to Public Review Issue 184: http://www.unicode.org/review/pri184/ on the topic of the proposed updates to Unicode Technical Standard #37: http://www.unicode.org/reports/tr37/tr37-6.html 

Overall, we are very happy with the direction the edits to UTS37 are taking. However, we don't believe they go far enough. The draft states:

    # If there are sequences that correspond to the same glyphic subset, it
    # becomes a burden for implementers, which can make a collection less
    # likely to be implemented. As a result, in an effort to minimize the
    # number of sequences that correspond to the same glyphic subset,
    # registrants are strongly encouraged to share sequences where sequences
    # in a submission are similar to those in an existing collection. As part
    # of the registration process, the registrar will encourage the sharing
    # of sequences. The sharing of sequences across collections requires
    # mutual agreement of the registrants for the affected collections.

Having multiple representations for the exact same text is not just a burden for implementations, but an obstacle to interoperability. Neither plain text nor fonts can be reliably exchanged among systems if some of them implement one set of IVS mappings for a particular glyph and others implement another. Such a closed-system approach is counter to the goals of Unicode and breaks down with real negative consequences for users on an open system such as the Web.

To mitigate this problem, we would like the draft to state that sequences *must* be shared where the glyphic subsets are known to be identical. Specifically, if the registrant cannot explain (in prose) how the new glyph being registered differs from all existing variants in the database, it should not be possible to register a new IVS. Note that we are not suggesting that any judgement be made as to the significance of the differences, only that a difference can be objectively described.

Furthermore, this prose should be a required part of the variant's registration. Requiring this explanation in the database will not only prevent duplicates but also help font designers understand which variations among glyph outlines in the database are significant and which are merely stylistic (due to the typeface of the submitted representative glyph). Since many of the significant differences are subtle, these differences can escape notice; and incidental differences can be mistaken for significant ones. So only with explicit information can font designers be expected to accurately and correctly represent the glyphic variations intended by the registrants.

We also suggest that Unicode take responsibility for creating and maintaining a mapping table for all existing codepoint representations of the same glyph. Requiring each individual font vendor to come up with its own mapping table, using its own interpretation of which glyphs should be identical, is a recipe for non-interoperability. Such an equivalency table should be standardized, and as such should be the responsibility of the Unicode Consortium to maintain.

Lastly we request that a single, canonical IVS registration be made available for each glyphic subset represented in the CJK Compatibility Ideographs and the appropriate mappings added to the duplicate-glyph mapping table. This will allow migration from the normalization-sensitive compatibility ideographs to the normalization-stable IVS solution and make the deprecation and eventual obsolescence of the compatibility ideographs a practical reality.

Thank you for your consideration,

Elika J. Etemad
Invited Expert
W3C CSS Working Group

Date/Time: Sun Jul 24 20:42:22 CDT 2011
Contact: seki@jp.fujitsu.com
Name: Masahiro Sekiguchi
Report Type: Public Review Issue
Opt Subject: Comments on PRI #184

Comments from Japan SC2 committee on Proposed Update UTS #37, Unicode Ideographic Variation Database (IVD) ====================================================================

First of all, we thank you for giving us the chance to review the proposed draft of UTS #37 "Unicode Ideographic Variation Database (IVD)".

We, Japan SC2 committee (JSC2) is concerned about the following new paragraph in the proposed draft.

"If there are sequences that correspond to the same glyphic subset, it becomes a burden for implementers, which can make a collection less likely to be implemented. As a result, in an effort to minimize the number of sequences that correspond to the same glyphic subset, registrants are strongly encouraged to share sequences where sequences in a submission are similar to those in an existing collection. As part of the registration process, the registrar will encourage the sharing of sequences. The sharing of sequences across collections requires mutual agreement of the registrants for the affected collections."

JSC2 proposes removing this paragraph from the proposed update UTS #37 for two reasons:

(1) The sharing of an ideographic variation sequence (IVS) is not a matter that the registrar encourages or should encourage. It should be decided at the registrant's own responsibility, if needed. (2) In a document that is a normative reference for an international standard, it should not be mentioned what any organization other than ISO encourages.

Still, if you think UTS #37 needs to state something about the sharing of sequences, it can be described with simply a statement saying that the registrant may share a sequence that is already registered in some other IVD.

The position of JSC2 is that a sequence should not be shared only because the shapes of the glyphs printed in the IVD are similar to the shapes of the glyphs printed in some other IVD. Whether or not the sequences should be shared should be decided based on a broader consideration of the conditions that all of the other (unprinted) glyphs represented by the glyphs in the IVD are also similar to those of the other IVD. However, a large effort is required to identify this relationship among IVDs. Assessing the degree of relationship requires deciding on complicated rules and procedures. The scale of this effort can be similar to the standardization of the CJK unified ideographs. This will greatly complicate the registration process of a new IVD.

End of document

185 Revision of UBA for improved display of URL/IRIs

Date/Time: Tue Jul 5 10:58:44 CDT 2011
Contact: tomerm@il.ibm.com
Name: Tomer Mahlin
Report Type: Public Review Issue
Opt Subject: 185 Revision of UBA for improved display of URL/IRIs

This comment relates to issue #185: Revision of UBA for improved display of URL/IRIs

UBA assures proper display of plain bidirectional text. Unfortunately not all text is plain. There are numerous cases in which we have text which is "... no ordinary text; instead it is syntactically complex in ways that don’t work well with the UBA...". Let us call such text samples/types - structured text. When structured text contains "... right-to-left text (such as Arabic or Hebrew) it appears jumbled, to the point where it is either uninterpretable, misleading, or ambiguous...".

URL/IRI is just one example of structured text types. Below you will find additional categories with samples using pseudo Bidi text. Structured text is ubiquitous in the software. Every concatenation implying structure falls into this category. Addressing all of them is necessary. Enhancement of UBA for addressing URL/IRI might help. However it does not address vast majority of types / cases. The approach for addressing proper display of structured text should take into account following factors affecting expected display of structured text:

1. GUI direction (a.k.a. component orientation in Java) - on mirrored and not mirrored GUI the direction of flow of structured text tokens may be different (e.g. bread crumb)

2. National preferences (e.g. mathematical formulas are expected to be displayed for Arabic users differently from Hebrew users).

3. Content of text - content of text may affect the expected order of structured text tokens. For example, date stamp including Bidi characters would be expected to be displayed with RTL order of tokens, while date stamp including only Latin characters would be expected to be displayed with LTR order of tokens.

4. Identification of structured text type (e.g. URL, file path, email, regular expression etc.) - display rules (which are dependent on GUI direction, national preferences and text content) for different types of structured text are different.

Some additional types of structured text

All examples use capital Latin letters to represent Bidi (e.g. Hebrew) characters. Logical order is the order in which characters were typed and stored in the text buffer. This order is also known as typing or chronological order. Actual display with UBA - text reordered by UBA

1. File path:
Logical order:           c:\FOLDERA\FOLDERB\123\FOLDERd
Actual display with UBA: c:\REDLOF\123\BREDLOF\AREDLOFd
Expected display:        c:\AREDLOF\BREDLOF\123\REDLOFd

2. Email address ("[display name]"<[dot-atom]@[Internet domain]> - RFC2822)
Logical order:           "TOMER MAHLIN"<TARAS.ABRAMOVICH@il.ibm.com>
Actual display with UBA: "HCIVOMARBA.SARAT>"NILHAM REMOT@il.ibm.com>
Expected display:        "NILHAM REMOT"<SARAT.HCIVOMARBA@il.ibm.com>  

3. Math formula
Logical order:           1 + 2 + ABC - DEF = 45
Actual display with UBA: 1 + 2 + 45 = FED - CBA
Expected display:        1 + 2 + CBA - FED = 45

4. SQL query
Logical order:           select ABC, DEFdef from SCHEMA.TABLEabc
Actual display with UBA: select DEF,CBAdef from ELBAT.AMEHCSabc
Expected display:        select CBA, FEDdef from AMEHCS.ELBATabc

5. Java code
Logical order:           int VAR3 = VAR1 - VAR2; /* THIS IS A COMMENT */
Actual display with UBA: int TNEMMOC A SI SIHT*/ ;2RAV - 1RAV = 3RAV */
Expected display:        int 3RAV = 1RAV - 2RAV; /* TNEMMOC A SI SIHT */

6. Regular expression
Logical order:           ([AD-LM-TXYZ]{1,23})|(HELLOworld)
Actual display with UBA: ([OLLEH)|({1,23}[ZYXT-M-L-DAworld)
Expected display:        ([AD-LM-TXYZ]{1,23})|(HELLOworld) - for Hebrew users according to Standard Institute of Israel

7. XML content (a.k.a. source view)
Logical order:           THIS <font NAME="DAVID">IS</font> HEBREW.
Actual display with UBA: SIHT <font SI<"DIVAD"=EMAN</font> WERBEH.
Expected display:        SIHT <font EMAN="DIVAD">SI</font> WERBEH.

8. Date/time stamp
Logical order:           24 MARCH 1935 13:20:12
Actual display with UBA: 24 13:20:12 1935 HCRAM 
Expected display:        13:20:12 1935 HCRAM 24

9. Bread crumb
Logical order:           ROOT > SECOND_LEVEL > LOWESTlevel > new_level
Actual display with UBA: TSEWOL < LEVEL_DNOCES < TOORlevel > new level  (LTR base text direction)
                         level > new_levelTSEWOL < LEVEL_DNOCES < TOOR  (RTL base text direction)
Expected display:        TOOR > LEVEL_DNOCES > TSEWOLlevel > new_level  (not mirrored GUI)
                         new_level < TSEWOLlevel < LEVEL_DNOCES < TOOR  (mirrored GUI)

10. Concatenated text

For example: '{0}' - {1} matches in workspace {2} where
{0} - regular expression
{1} - integer
{2} - regular expression

Logical order:           '*HELLOWORLD*' - 6 matches in workspace (*TOMER*)
Actual display with UBA: '*6  - '*DLROWOLLEH matches in workspace (*REMOT*)
Expected display:        '*DLROWOLLEH*' - 6 matches in workspace (*REMOT*)

I can share a much more detailed document with a lot of graphical examples and analysis. Please drop me a note to tomerm@il.ibm.com

Many thanks in advance.

Date/Time: Thu Jul 7 08:37:58 CDT 2011
Contact: shai@platonix.com
Name: Shai Berger
Report Type: Public Review Issue
Opt Subject: pri185: Separators, the % character, TLDs

This report supersedes my report of same subject dated July 4th; my apologies for any inconvenience. [NOTE: that superseded report has been commented out of this feedback document, Ed.]

Dear Unicode Technical Committee,

In response to PRI185, I would like to submit the following notes:

0. Field ordering: I am a native speaker of Hebrew, an RTL language. As such, I ask you to adopt a variation of "Content order" that mostly coincides with "Constant order", namely, only order the fields RTL if the scheme starts with an RTL character. BiDi users are accustomed to LTR fragments in (mostly) RTL texts, and read them LTR; so if a latin scheme like "http" displays at the right end of an IRI, it will appear to be in the end rather than the beginning.

1. It seems the scope of your issue is a little too narrow; there is, in real life, a close correspondence between IRI/URLs and file paths (even ignoring the "file://" IRI scheme). Creating a discrepancy between the presentations of the two would be, IMHO, most unfortunate, so I would urge you to consider widening the scope of the UBA extension to include file paths as well as IRIs.

One such discrepancy can be caused by file name extensions (such as ".c" or ".exe"), which PRI185 ignores. File name extensions are semantically significant in file names, and therefore must be dealt with carefully in that domain; they are also common in URLs (e.g. the URL for this reporting page ends with ".html"). Taking this into account (and IMHO to an extreme), a recent Israeli Standard proposal (SI 5857) considers the period character (U+002e) as a field separator in all file paths and URLs. I wouldn't go that far; I think a more elaborate rule is required, so that periods are only considered field separators where they are used to mark an extension. I can give a more detailed offer for such a rule if the committee is interested.

As an example of the problem, consider files compressed with the 7-Zip compression utility. An IRI stored as "http://example.com/FILE.7z" will be presented as "http://example.com/7.ELIFz" according to the current proposition.

2. The percent sign (%, U+0025) has to be dealt with specially in URLs, where it is used as an escape character (e.g. "?" is written as "%3f"); the percent sign and the following two (hexadecimal) characters must form an unbreakable unit, as if surrounded by LRE/PDF. As the proposal stands, a path component "QUESTION%3f" would be presented as "%3NOITSEUQf" (assuming LTR base direction). I am not sure about the ordering of escape sequences within a larger scope; that is, I am not sure if the right presentation for the example is "%3fNOITSUEQ" or "NOITSUEQ%3f"; I tend towards the former (consider as neutral), to make sure "Q%26A" (%26 is "&") is presented as "A%26Q" rather than "Q%26A". However, I am not sure if "QUESTION%3f%21" (%21 is "!") should be displayed as "%3f%21NOITSUEQ" or "%21%3fNOITSUEQ".

3. Recent decisions at ICANN (http://www.icann.org/en/announcements/announcement-20jun11-en.htm) are likely to make the list of top-level domains (TLDs) much bigger and more dynamic than it has been, undermining the technique of recognising TLDs mentioned in the PRI.

Date/Time: Thu Jul 14 07:49:00 CDT 2011
Contact: aharon@google.com
Name: Aharon Lanin
Report Type: Public Review Issue
Opt Subject: pri185 and email addresses

It sounds like internationalized email addresses should be easy to include in this proposal, since they too end in a TLD. In fact, they should be easier to do, since they have a much simpler syntax than IRIs.

Date/Time: Thu Jul 14 08:42:02 CDT 2011
Contact: aharon@google.com
Name: Aharon Lanin
Report Type: Public Review Issue
Opt Subject: PRI #185: is the browser address bar a special case?

If I understand correctly, PRI #185 proposes modifying the UBA itself, so all text to which the UBA is applied will reflect it. It also acknowledges that it will not catch all IRIs, but only a useful subset. It also aims to improve security, in particular making it more difficult to create malicious IRIs that mislead the user into thinking that a different entity is being addressed.

Unfortunately, I see a problem satisfying all three of these simultaneously. A malicious author could find an IRI that PRI #185 either does not recognize as an IRI or thinks that it is an IRI followed by some normal text. Since the ordering rules proposed by PRI #185 would then fail to apply to all or part of the IRI, the IRI as displayed by the UBA would still mislead the user into thinking that a different entity was being addressed. The malicious IRI would look the same misleading way both in the text of an document or email and in the browser's address bar, since the address bar is doing nothing more than calling the usual UBA implementation on its contents.

Perhaps a partial solution lies in requiring UBA implementations to allow the caller to explicitly request that certain parts of the input text be displayed under IRI rules (i.e. that field separators have a strong direction), whether or not they look like an IRI to the UBA. Thus, the browser address bar could explicitly tell the UBA that its entire text is an IRI. The result would be that the malicious IRI would lose its misleading effect at least when displayed in the browser address bar, hopefully being noticed by the user before typing sensitive information into the page's inputs.

Date/Time: Sun Jul 17 12:22:23 CDT 2011
Contact: mohiem@eg.ibm.com
Name: Mohamed Mohie
Report Type: Public Review Issue
Opt Subject: Feedback on PRI #185 Revision of UBA for improved display of URL/IRIs

This feedback is to answer on the Open issues present in the PRI. I vote for going with the option of ordering depend on whether there are any RTL characters in the IRI. The advantage of that is to have the IRI readable when it contaings Arabic RTL text, having the IRI in LTR format while it contains Arabic RTL text makes it unreadable and difficult to the user to understand the proper order and readability of IRI. Thus I suggest to have for Arabic the IRI presented with embedding level RTL if the IRI contains Arabic characters.

Date/Time: Sun Jul 24 09:28:08 CDT 2011
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Report Type: Public Review Issue
Opt Subject: PRI#185 Revision of UBA for improved display of URL/IRIs

I want to address the second part of PRI#185, "Proposed extension of UBA for bidi_IRIs". For me, the litmus test of appropriate presentation of IRIs is the bus side case: may an IRI appearing on a bus side (or anywhere else which cannot be mouse-clicked) be understood in correct order by the man in the street? For this to happen, the rules must be very simple, even if simplicity comes at the cost of some intuitiveness. The current proposal in PRI#185 is to have all fields composing the IRI laid out from left to right. This is the simplest rule possible, thus it gets my vote.

There is a variant where the order of fields would follow the current embedding level. To which I reply: what is the embedding level of a bus side? of a napkin? The concept of embedding level, while not very complex, is more than what should be expected from the "man in the street", and it leads to have the same domain displayed differently according to context. Confusing!

The third option, having the order of fields depend on whether there is any RTL character in the IRI, is still more problematic. Having or not having a RTL character in the path, query or fragment part would turn around the whole IRI display, including the domain part. This would be very confusing IMHO.

Bottom line: the fields should be displayed always in LTR order. This is also quite straightforward to implement once an IRI has been identified, which should appeal to implementers and thus facilitate adoption by applications.

Date/Time: Mon Jul 25 22:33:18 CDT 2011
Contact: behnam@esfahbod.info
Name: Behnam Esfahbod
Report Type: Public Review Issue
Opt Subject: Comments on PRI185 (URL/IRI in UBA)


First, we know that schema-less URI/IRIs (in many cases only domain names) are used in plain text a lot. The fact that the UBA implementation, in some level, would depend on the list of TLDs is very dangerous. The list of TLDs updates a few times each year (and is going to have more updates in the near futures, because of ICANN's New gTLDs program). This would result in noticeable inconsistencies between the representation of URI/IRIs in different applications/systems, depending on the version of the TLD list their implementation will use.

Second, with the introduction of IDN TLDs, soon we will see people use RTL domain names under RTL TLDs. Most probably mixed-direction domain-names would not be a big issue anymore. Of course the problem persists for the "path" and "query" parts of the URL/IRI.

Third, I believe the best method to set the base direction for a URL/IRI should be based on the direction of the TLD, not the first label of the domain name.

Because of the above issues, I would like to ask for reconsidering the general idea and further research in the feasibility and security/usability concerns.

Best regards,
-Behnam Esfahbod

Date/Time: Thu Jul 28 09:28:18 CDT 2011
Contact: aharon@google.com
Name: Aharon Lanin
Report Type: Public Review Issue
Opt Subject: PRI #185

It has been suggested that the overall direction in which an IRI be displayed should depend on its text. In the most limited sense, if an IRI contains RTL (R or AL) characters but no LTR (L) characters, then it should be displayed RTL overall.

The problem with this limited proposition is that such an IRI would have to be displayed LTR as soon as it had to be extended to include a filename extension (e.g. .html) or a query containing LTR characters (e.g. ?x=%4C). (Also, there is the problem of the schema names (e.g. http:), which are always LTR.) Such flipping would be very confusing to the user, and the benefit of seeing all-RTL IRIs RTL-overall would be more than offset by this confusion.

So, let's extend the proposal to limit the all-RTL condition to the domain name only. For a while, I was convinced that this would be best. Unfortunately, this approach suffers from a fatal ambiguity that is sure to be exploited with malicious intent: WWW.HACKERS.COM/com.bank.www would be displayed as www.bank.com/MOC.SREKCAH.WWW, the same as www.bank.com/COM.HACKERS.WWW. Even IRIs that include the schema name suffer from this problem: http://WWW.HACKERS.COM?path/boring/and/long/very/a/com.bank.www//:http would be displayed as http://www.bank.com/a/very/long/and/boring/path?MOC.SREKCAH.WWW//:http, the same as http://www.bank.com/a/very/long/and/boring/path?COM.HACKERS.WWW//:http.

For these reasons, I now oppose displaying an IRI in a direction that is a function of its content.

Another direction that has been proposed is to display IRIs in the same overall direction as the embedding level in which it appears. I think that this approach is strictly worse than the always-LTR approach. Reading an all-LTR IRI displayed RTL overall, e.g. www.acompany.com/files/a.txt displayed as txt.a/files/com.acompany.www is just as unnatural as reading an all-RTL IRI when it is displayed LTR overall. Even when all-RTL IRIs reach their full potential, my guess is that RTL users will still see all-LTR IRIs at least as much as all-RTL IRIs. Thus, displaying all-RTL IRIs well at the cost of displaying all-LTR IRIs poorly is not beneficial.

My conclusion is that IRIs should always be displayed LTR overall.

Date: Thu, 28 Jul 2011 14:48:37 -0400
From: Behdad Esfahbod <behdad@behdad.org>
Subject: Feedback re PRI #185 Revision of UBA for improved display of URL/IRIs

I dislike the proposed change for various reasons:

1. UBA is complex enough for users to make sense of. The proposed changes considerably increase complexity.

2. It just has been long enough since the last changes to UBA and Bidi properties that one can be fairly sure that all implementations do a decent job of implementing the standard. Making such drastic changes to the algorithm would take years to ship across the industry, and as Jonathan also mentioned, may never find its way in many implementations. What's the point?

3. From an implementation point of view, the proposed changes are a drastic departure from the existing algorithm. For example, the current algorithm works solely based on a sequence of bidi types, whereas the proposed changes work with specific characters.

4. I like to emphasize my interpretation of this point from Aharon's feedback: the proposal attempts to make phishing harder. However, from what I understand, it leaves room for malicious IRI's that do not match the BNF and hence get reordered as non-IRI's to render the same as other, legitimate, IRIs. Hence, I one could say that the proposal fixes one set of phishing possibilities but opens another.


186 Word-Joining Hyphen

Date/Time: Thu Jul 7 01:01:06 CDT 2011
Contact: karl-pentzlin@acssoft.de
Name: Karl Pentzlin
Report Type: Public Review Issue
Opt Subject: PRI #186: Support of the property "MidLetter" for the non-breaking hyphen

I consider it an advantage to give U+2011 NON-BREAKING HYPHEN the word-break property MidLetter. E.g. for Swiss place names like "S‑chanf" or "Chamues‑ch", containing the Rhaeto-Romanic (Putèr dialect) tetragraph "s‑ch" (denoting a sound distantly similar to Russian щ), it is annoying when a double click does not mark the whole name. When the non-breaking hyphen is used for typographic purposes (like avoiding a line break in "U-Bahn" or "over-a-dozen"), getting the whole word marked is not mandatory but useful, indicating to the user that the hyphen is in fact non-breaking.

By the way, WG2 N3770 requests for the newly encoded U+2E3A TWO-EM DASH the same line breaking behavior as for the non-breaking hyphen. It is appropriate to extend this to request the word-break property MidLetter also for that character, in the same way as proposed for the non-breaking hyphen. E.g., in the text sample shown on p.1 of WG2 N3770 "David H—h voted aye" (the two-em dash, as not available yet, is replaced by an em dash in this citation), the word to be marked is unambiguously "H—h".

Date/Time: Thu Jul 28 05:04:49 CDT 2011
Contact: per.starback@lingfil.uu.se
Name: Per Starbäck
Report Type: Public Review Issue
Opt Subject: PRI #186

A reason given for U+2011 NON-BREAKING HYPHEN getting the word-break property MidLetter is that some languages use a hyphen character between syllables within a word where word breaking, such as by word-selection or move-to-next-word commands, should ignore these hyphens.

# The advantage of making this change is that U+2011 NON-BREAKING HYPHEN # could be used in orthographies that contain interior hyphens. This # would avoid a requirement to encode yet another confusable # hyphen/dash/minus character to the over-a-dozen already in Unicode.

The implication is that the alternative to the suggestion is to add a new character. I don’t see such a requirement! Yes, it’s sometimes hard to know where word boundaries are, and Unicode certainly helps, but that doesn’t mean the characters on their own have to completely solve that problem. Knowledge about the language being used can also be useful, for example.

Compare this with RIGHT SINGLE QUOTATION MARK, used as quotation mark and apostrophe such that extra knowledge can be needed to know where the word divisions are:

’Tis just a highfalutin’ idea, reminding me of that ‘sublime masterwork’ L’Étranger that I don‘t approve of.

For instance a mark-word operation on "highfalutin’" should ideally include the apostrophe but not on "masterwork". In this case it could be done by keeping track of starting and ending quotes. Other cases are even harder. When my native Swedish uses single quotation marks it traditionally uses RIGHT SINGLE QUOTATION MARK both before and after a quotation. Then if you read

	Jag såg ’na när hon spela’ piano.
      = Jag såg henne när hon spelade piano.           
      = I saw her when she played the piano.

no simple algorithm would know that there isn’t a quote "na när hon spela" in there.

My argument is that finding word boundaries is a hard problem, that isn’t totally solvable by only Unicode anyway. Yes, it would help if the quotation mark and the apostrophe were seen as different characters here, even though they look the same, but for good reasons they are seen as the same character in Unicode. And certainly no one is suggesting different "characters" for joining and splitting apostrophes (using terminology from http://unicode.org/mail-arch/unicode-ml/y2002-m08/att-0428/01-cimaUTR29.html). Or for "starting" and "ending" RIGHT SINGLE QUOTATION MARK!

In the same way the suggestion in PRI #186 would help with finding word boundaries in some particular languages/orthographies, but at the cost of "lying". Hyphens *are* hyphens, even when they are used for slightly different reasons in different orthographies.

I don’t know about the Iu Mien language mentioned in the PRI, but would it even be correct to disallow *line* breaks with NON-BREAKING HYPHEN in many of these cases? Wouldn’t it be acceptable to hyphenate some of these words?

[This feedback is constructed from a couple of postings by me to the unicode mailing list, but with some errors corrected.]

187 Second registration of sequences for the Hanyo-Denshi collection

No feedback was received via the reporting form this period.

188 Proposed Update UAX #9: Unicode Bidirectional Algorithm

No feedback was received via the reporting form this period.

190 Proposed Update UAX #14: Unicode Line Breaking Algorithm

No feedback was received via the reporting form this period.

191 Proposed Update UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

192 Proposed Update UAX #24: Unicode Script Property

No feedback was received via the reporting form this period.

193 Proposed Update UAX #29: Unicode Text Segmentation

Date/Time: Mon Jul 25 23:40:10 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX #29: Updated grapheme cluster boundaries for Thai/Lao/Tai Viet

2011/7/26 <announcements@unicode.org> wrote:
> > The proposed update documents for some Unicode Standard Annexes have been
> > updated. These updates include:
> >
> > UAX #29: Updated the discussion of legacy grapheme clusters for Thai. Moved
> > the section on Hangul syllable boundary determination to a new section in
> > this UAX, from Chapter 3 of the Core Specification. Made other small
> > editorial fixes.

It looks like this is almost reversing the state of the recommandation
in favor of extended grapheme clusters, just for the needs of Thai,
Lao and Tai Viet (which are the only scripts encoed with a logical
order exception).

But unfortunately, "legacy grapheme clusters" are not extensing to most "SpacingMark".

May be it would be more convenient to split the "SpacingMark" category in two parts:

- (1) create an "Append" category for the listed Thai and Lao appended vowels:


- (2) Exclude the "Append" category from the definition of "SpacingMark" (remove the list above) Grapheme_Cluster_Break ≠ Extend, and Grapheme_Cluster_Break ≠ Append, and General_Category = Spacing Mark

Then deprecate both the "legacy grapheme cluster boundaries" and the "extended grapheme cluster boundary", to create an intermediate one "default grapheme cluster boundaries".

The "default grapheme clusters" will extend the "legacy grapheme clusters" only to the reduced "SpacingMark" category (but not to the existing "Prepend" category or the new "Append" category:

default_grapheme_cluster ::=
| ( Hangul-syllable | !Control )
( Grapheme_Extend | Spacing_Mark)*
| . )

For compatibility, the existing "extended grapheme clusters" (not recommended) will be redefined to be the new "default grapheme cluster boundaries", extended to also include the existing "Prepend" category and the new "Append" category. This won't change its generated boundaries:

extended_grapheme_cluster ::=
| Prepend* ( Hangul-syllable | !Control )
( Grapheme_Extend | Spacing_Mark | Append)*
| . )

This way, Thai, Lao, Tai Viet will be correctly handled in the prefered way using the new "standard grapheme cluster boundaries", which can also be recommanded for all other scripts.

-- Philippe.

194 Proposed Update UAX #31: Unicode Identifier and Pattern Syntax

No feedback was received via the reporting form this period.

196 Proposed Update UAX #38: Unicode Han Database (Unihan)

Date/Time: Wed Jun 29 13:04:43 CDT 2011
Name: K------e
Report Type: Error Report
Opt Subject: kMandarin field documentation for Unihan


Regular expression for kMandarin field has the wrong special character. U+0308 = Combining Diaeresis is *not* used. The correct additional character is U+00DC = Latin Capital Letter U With Diaeresis.

197 Proposed Update UAX #41: Common References for Unicode Standard Annexes

No feedback was received via the reporting form this period.

198 Proposed Update UAX #42: Unicode Character Database in XML

Date/Time: Mon Jun 27 13:49:45 CDT 2011
Contact: ernestvandenboogaard@hotmail.com
Name: Ernest van den Boogaard
Report Type: Error Report
Opt Subject: UAX #42 (XML): possible typo

See http://www.unicode.org/reports/tr42/ Section 4.4.15 Indic Properties

The two XML attributes InSC and InMC are introduced by these two square-bracketed notes: [hst property, 37] [jamo property, 38]

The words "hst" and "jamo" look unrelated to the term they stand for. Also, they are a copy from the above section about Hangul, referencing Hangul_Syllable_Type and Jamo_Short_Name (making sense there as an abbreviation). I suggest a check on the correctness of these words in the Indic Properties section.

In general, the effect of such a typo I cannot oversee, since the use or definition of the square-bracketed labels I could not find.

Ernest van den Boogaard

199 Proposed Update UAX #44: Unicode Character Database

No feedback was received via the reporting form this period.

200 Draft UTR #49: Unicode Character Categories

Date/Time: Wed Jul 13 20:04:58 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UTR #49: Unicode Character Categories

It looks like the subcatories for [Letter] are not very well formulated in the current CharacterCategories.txt datafile, and in fact inconsistant.

The most obvious level-2 suncategory should include [Consonnant], [Vowel], and [Half-consonnant]. Other distinctions like [Digraph] should be moved in a lower category.

Note that [Consonnant] has been applied to the full basic Arabic abjad, but not to the similar Hebrew abjad.

In fact, it also should make distinctions between true [Consonnant]s and [Half-consonant]s, the later including letters that can act either as consonnants (acting like a mute or stop consonnant with a default inherent or implied vowel, possibly modified by acting as an holder for an optional vocalic diacritic/mark), or as vowels (e.g. Alef and Yod in Arabic or Hebrew; Y in Latin; RA and LA in Indic scripts), depending on their context.

Yes, it may be fuzzy with some languages using the same script (e.g. W in German is undoubtly a consonnant, but in many languages this is most often a gliding consonnant ; or V in Roman Latin where there was no distinction with U; but at least, categorizing as [Half-consonnant] will trigger the ambiguity of its use.

Then the third level should be for case distinctions [Lowercase], [Uppercase], [Titlecase] and [Uncased] (in scripts that have case distinctions).

The last level can then be used for [Ligatured] (such as Œ and Æ, even if they are still considered as a plain letter, this still allows spcific languages to consider them as letter pairs for collation purpose), [Digraph] (such as IJ), [Final] (e.g. Greek final sigma)

The content of this (informative) file should also be consistant with the content of the DUCET (which obviously contain case distinctions at the third level). However secondary differences exposed inthe ducet (e.g. for diacritic differences) should probably not be categorized.

And like the DUCET, it should be tailorable in applications or in specific languages (for example in the CLDR database), so that these categories are just the default ones used when there's no tailoring.

Subject: Re: PRI #200: Draft UTR #49, Unicode Character Categories
Date: Thu, 14 Jul 2011
From: Andrew West

On 14 July 2011 00:03, <announcements@unicode.org> wrote:
> The Unicode Technical Committee has posted a new issue for public review and
> comment. Details are on the following web page:
> PRI #200 Draft UTR #49: Unicode Character Categories
> This document presents an approach to the categorization of Unicode
> characters, and documents data files that implementers can use for defining
> and labeling Unicode character categories.

==General Rant==

I like the idea of categorizing characters hierarchically, but any categorization scheme is necessarily subjective to a greater or lesser degree, and I do not think that the Unicode Consortium should be pushing one particular hierarchical categorization model as the definitive categorization of Unicode characters. It seems to me that this is one of several recent expansions to the scope of Unicode Character Database (ScriptExtensions.txt is another example) that are neither necessary nor particularly helpful.

==Specific Comment==

There are 18 top-level categories:


What are the differences between [Ideograph] and [Ideogram], and between [Logograph] and [Logogram] ? Even if UTR #49 does give distinctly different definitions for each of these four top-level categories, it will not be obvious to most users of Categories.txt what the difference between Ideograph and Ideogram and between Logograph and Logogram is as the -graph/-gram versions are synonymous in general use:

< http://en.wikipedia.org/wiki/Logogram > < http://en.wikipedia.org/wiki/Ideogram >


Date/Time: Tue Jul 19 17:54:40 CDT 2011
Contact: amaithianantha@yahoo.co.in
Name: A.R.Amaithi Anantham
Report Type: Error Report
Opt Subject: Draft UTR#49. Unicode Character Catagories


The Tamil Text can be simply explained as follows:

For this, first of all, you forget the Tamil Alphabets, for a moment. Look at the world. What you see, are the creatures of God. They are classified as 1. bodies with soul, with one to six senses, 2. dead bodies and land, water, fire, air, sky and 3. soul, which, actually, not seeing, by naked eye, but realising

A) Now, forget, the the world, see the Tamil Alphabets. 1. vowel-consonants (bodies with soul)2. consonants (dead body) 3. vowels (Soul). Now, you feel that that text containing Tamil Alphabets reflects the very nature of the world.

B) In a Tamil Word, you can not see any vowel, except in the first place, anywhere, throughout the word.

C) You can not see a consonant, in the first place of any word. What do all these mean:

1. Soul with bodies exists in the world and similarly, vowels with consonants exists in the text.

2. If the first letter, in a word is a vowel can be thought of as given under:

Consonants will take the vowel on its right side and vowel - consonant is created.There fore a sole soul is left without a consonant to combine with. This is similar to the soul, in the world, without any body to take that soul.

D. The same consonants, in a word, do not exist, side by side. But, two different consonants can exist, side by side. Further, not more than two different consonants can exist, any where in a text.

E. One consonant only can join together with one soul.

These are the fundamental nature of Tamil, Which is one of the classical languages of the world.

I the above circumstance, the categories of letter should include, vowel-consonant.

So, the Categories of letter shall be as follows

1) Letter > vowel-Consonant
2) Letter > Vowel
3) Letter > Vowel > Dependent (i.e. Indic matras)
4) Letter > Consonant > Dependent > Subjoined
Therefore, the Code Points are to be allotted for vowel - consonant also

Yours Sincerely
A.R.Amaithi Anantham

Date/Time: Wed Jul 20 20:09:18 CDT 2011
Contact: unicode@behdad.org
Name: Behdad Esfahbod
Report Type: Public Review Issue
Opt Subject: UTR#49 categories for Arabic Letters

Currently all Arabic letters are categorized as Letter_Consonant. That is wrong for any non-consonant letter like ARABIC LETTER ALEF. Normally consonant vs vowel distinction is not made at the letter level in Arabic, so I suggest removing any sub-categorization of Letter for all Arabic characters.

Date/Time: Thu Jul 21 12:54:10 CDT 2011
Contact: Bob_Hallissy@sil.org
Name: Bob Hallissy
Report Type: Public Review Issue
Opt Subject: PRI 200 -- suggested changes for Arabic

In http://www.unicode.org/reports/tr49/Categories.txt I suggest that these:

0618	Mn	[Diacritic]	[Annotation]	[X]	[X]	ARABIC SMALL FATHA
0619	Mn	[Diacritic]	[Annotation]	[X]	[X]	ARABIC SMALL DAMMA
061A	Mn	[Diacritic]	[Annotation]	[X]	[X]	ARABIC SMALL KASRA
0656	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC SUBSCRIPT ALEF
0657	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC INVERTED DAMMA
0659	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC ZWARAKAY
065C	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC VOWEL SIGN DOT BELOW
065D	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC REVERSED DAMMA
065E	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC FATHA WITH TWO DOTS
065F	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC WAVY HAMZA BELOW

should be classified as vowels similar to:

064E	Mn	[Mark]	[Vowel]	[Point]	[X]	ARABIC FATHA
064F	Mn	[Mark]	[Vowel]	[Point]	[X]	ARABIC DAMMA
0650	Mn	[Mark]	[Vowel]	[Point]	[X]	ARABIC KASRA
0652	Mn	[Mark]	[Vowel]	[Point]	[X]	ARABIC SUKUN
0670	Mn	[Mark]	[Vowel]	[Point]	[X]	ARABIC LETTER SUPERSCRIPT ALEF

Conversely, I suggest

0651	Mn	[Mark]	[Vowel]	[Point]	[X]	ARABIC SHADDA

should NOT be classified as a vowel since it (a) modifies the consonant and (b) can coexist with vowels. Rather it should be

0651	Mn	[Diacritic]	[X]	[X]	[X]	ARABIC SHADDA

Evidence for vowel classification:

Re 0656 and 0657,  L2/01-425 says:
> > U+0656 ARABIC SUBSCRIPT ALEF [is] Used to indicate a long /i:/ vowel, or /i/ as contrasted with /e/.
> > U+0657 ARABIC TURNED [sic] DAMMA [is] Used to indicate a long /u:/ vowel, or /u/ contrasted with /o/
Re U+0659 ARABIC ZWARAKAY, L2/03-144R (N2581R2) quotes:
> > And for this vowel, we prescribe the a sign like “ ¯ ”, that is a horizontal 
> > zebar. Zebar or fatha has some curvature [sic], but this is straight. Zebar: 
> > “   ”, zwarakay: “ ¯ ”. Observing this vowel is highly necessary, or otherwise 
> > the meaning will become wrong altogether.
065A, 065B, and 065C are called vowels in their name, and L2/03-168 backs this up in their description:
> > Several new signs have been used in African languages to 
> > represent vowel sounds not present in standard Arabic.
Re 065D and 065E, L2/04-025R (N2723) says:
> > The remaining two proposed characters are combining marks used 
> > to indicate vowels in extended Arabic-based writing systems.
Re 065F, L2/09-215 (N3673) says:
> > This combining mark is used to indicate a common vowel in 
> > Kashmiri, and can appear under many base characters.
The other 3 are less clear -- and I don't claim to be an Arabist -- but L2/06-358R (N3185R) says:
> > are smaller versions of the simple harakat ARABIC FATAH, ARABIC 
> > DAMMA, and ARABIC KASRA. Readers who do not know the finer points 
> > of the grammar of the standard Arabic language make use of these 
> > characters to pronounce some initial ALEFs correctly. For example, 
> > see Figure 2, verses 2 and 4 from Sura 83. Here, verse 2 begins 
> > with an ALEF bearing a SMALL FATHA, while verse 4 starts with an 
> > ALEF bearing a normal FATHA. 

so it sounds to me like these are used as alternates for Fatah, Damma, and Kasra which are classified as vowels.

I realize that any given categorization might be good for some purposes and not for others, so I will state my bias: I'm trying to figure out what combinations of Arabic characters are likely to be meaningful and thus need to be accounted for in font logic. E.g. we it is unlikely that we would need to handle multiple vowel marks on a single base.


Bob Hallissy

From: Tamil Virtual Academy [tamilvu@yahoo.com]
Date: Monday, July 25, 2011 3:08 AM
Subject: Draft UTR#49. Unicode Character Catagories - Error Report

Dear Sir, We saw the Unicode Technical Report #49 for categorization of Unicode characters. This indicates only two categories namely, vowels and consonants. However, the Tamil language has got three categories, namely vowels, consonants and vowel consonants. The vowel consonants are very important for Tamil Language. Most of the Tamil words consist of vowel consonants. Therefore, the Vowel-Consonant should be included in the Unicode character categories and provided code point for quick rendering of Tamil words. Thanking You,

Yours faithfully,

Dr. P.R. Nakkeeran,
Tamil Virtual Academy
(Erstwhile Tamil Virtual University)
[Government of Tamil Nadu]
Gandhi Mandapam Road,
Anna University Campus,
Kottur, Chennai – 600 025.
Ph : 044-2230 1017

Date/Time: Tue Jul 26 01:42:55 CDT 2011
Contact: behnam@esfahbod.info
Name: Behnam Esfahbod
Report Type: Public Review Issue
Opt Subject: Comments on TR49, Character Categories

Dear Sir/Madam,

There are a few problems with the way Categories are defined for the characters of Arabic script.

1. Obsolete Arabic Characters

Obsolete Arabic characters (blocks Arabic Presentation Forms-A and -B) are not intended to be used in new text. Unfortunately there has never been a good method to detect such characters (obsolete characters).

One problem is that categories that are currently set for these character are very different from the ones set for desired Arabic characters (blocks Arabic and Arabic Supplement).

And another problem is that there is no way for an application, like a character map, to not show this obsolete characters in their Arabic block, using character properties. Although the idle solution would be to have another property (vertical to the categories) for "obsolete" and other similar statuses of characters, but the current Categories data set for Arabic characters are not useful enough.

2. Kuranic Arabic Characters

There are a few characters in block Arabic that are intended to be used only in Kuranic text. Most of these characters don't have any application in normal text in any of languages that use Arabic script, including the Arabic, Persian, Urdu, Kurdish, Azari, and Pashto languages. It would be very useful to have different subcategories for these characters.

For example, U+0618 ARABIC SMALL FATHA is a Kuranic "Harkat", which has a very similar appearance to U+064E ARABIC FATHA and the only difference is the their size, which relatively to the size of Arabic letters are small anyway. In a characters map application, average user would not notify the difference and would pick the wrong character by accident. Having a difference subcategory, the application can make sure user understands the different.

3. Arabic Letters

The current categories set for Arabic letters are Letter > Constant, which are quite correct. First, it is not clear what "Constant" means here, and why Latin, Greek, and Cyrillic letters does not have any subcategories set for their letters, but in Arabic and Syriac all the letters are in the subcategory Constant. Second, not all Arabic letters are Constants, and in fact, there is no clear line to separate constant letters from the rest, as some characters have different sounds in different languages.

4. Arabic Form Shaping

There are two characters defined to control Arabic Contextual Joining algorithm in rendering engines (U+206C INHIBIT ARABIC FORM SHAPING and U+206D ACTIVATE ARABIC FORM SHAPING). Although these characters are not widely used, but it would be useful to have separate subcategory for these two characters, like "Shaping". But their subcategory should not be set to "Joining", which is used for widely-used U+200C ZWNJ and U+200D ZWJ characters.

Best regards,
-Behnam Esfahbod

Date/Time: Tue Jul 26 13:17:07 CDT 2011
Contact: doug@ewellic.org
Name: Doug Ewell
Report Type: Public Review Issue
Opt Subject: Support for Draft UTR #49

I support the proposed Unicode Technical Report #49, "Unicode Character Categories."

Anyone who has ever tried to categorize characters knows that no solution can be perfect, or suit all needs, but UTR #49 is an excellent effort that carefully considers different users' needs and its own limitations.

I'm not sure why the "first and second key" paragraphs in Section 2.4 are being deleted. To me these are critical to understanding the nature of the TR, particularly the caveat that the categories are not meant to be normative or permanent.

Date: 2011/08/02
From: Martin Hosken

Here is my review of UTR#49.

1. Actual use cases are not presented. Why do applications need this information? What uses is the information needed for?

2. It's bad enough that sequences can change the components to have a different general category (e.g. virama/halant/sakot + consonant type sequences), but that gets multiplied more so when it comes to whether that sequence now represents a vowel or a consonant or a tone.

3. While in many cases such information is agreed across all writing systems using a script, there are many cases where there is no such agreement. For example in some scripts, the same diacritics have different meanings/categories in different languages.

Realising that the information can never be sufficiently correct in all situations re-raises question 1. Why is it wanted? What in UTR#49 limits the expectations of the quality of this data? There are reasons why the headings in the charts are informative and editorial, and that is because readers are aware that they are somewhat arbitrary and are merely there to show the internal structure of the charts. Anything more is placing too heavy a semantic burden upon them.

Yours, Martin

201 Draft UTR #45: U-Source Ideographs

No feedback was received via the reporting form this period.

Encoding Feedback

Date/Time: Tue May 17 17:57:47 CDT 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Other Question, Problem, or Feedback
Opt Subject: L2/11-102R: Disunify Tarot cards

The problem with disunifying the 56 playing cards with the Tarot minor arcana is that there are *lots* of alternatives for how playing cards should look. Apart from the fact that there are hundreds of very different Tarot Minor Arcana decks, from complex pictures of all sorts to purely iconic ones, there are also a lot of different national variants of the Anglo-French poker/bridge deck.

For example, in southern and eastern Germany, the suits are Hearts, Bells, Leaves, Acorns; in other parts of the country,, Anglo-French symbols are used but with German colors (green Spades, yellow Diamonds), and in Italy the suits are Coins, Swords, Cups, Batons (the same as the Tarot deck) -- yet these decks are isomorphic to the Anglo-French deck or subsets of it. Spain and different Italian regions also have their own pip designs and color schemes.

I don't think it's reasonable to suppose that Unicode is going to encode all these various decks separately. Therefore, the wording should be changed to make clear that what you see is *not* necessarily what I will see, that the 14 cards of four suits are symbolic and schematic only. I'd propose something like this:

The most usual appearance of these characters is as the Anglo-French-style playing cards used with international bridge or poker. However, in different countries, both the suits and the colors may be substantially different, and when used to represent the cards of divination Tarot decks, the visual appearance is usually very different and much more complex.

No one should expect reliable interchange of the exact appearance of these characters without additional information (such as a font) or agreement between sender and receiver. Without such information or agreement, the glyphs have only a symbolic and schematic equivalence to particular varieties of actual playing cards.

Date/Time: Tue Jun 7 20:20:46 CDT 2011
Contact: fantasai@inkedblade.net
Name: Elika J. Etemad
Report Type: Error Report
Opt Subject: ScriptExtensions

North Indic fractions and Aegean numbers and measures should be listed in ScriptExtensions.txt. See http://www.unicode.org/mail-arch/unicode-ml/y2011-m06/0016.html for discussion thread.

Date/Time: Thu Jul 7 09:27:07 CDT 2011
Contact: elena@hotmail.ru
Name: Elena
Report Type: Feedback on an Encoding Proposal
Opt Subject: Right Russian quotation mark should be defined.

In the file http://www.unicode.org/charts/PDF/U2000.pdf you define the character 201E as "low double comma quotation mark". Yes, this is used as an opening double quotation mark in Russian language. But you do not define the corresponding closing double quotation mark. The corresponding closing double quotation mark looks like 201C, when you see it separately, but it is a different character, because it is _closing_ (not opening, as in English). When 201C is used as a substitution, the space between symbols looks bad in some fonts. Another problem is that it makes impossible to make an automatic substitution of quotation marks to other type (like 00AB and 00BB), when 201E is used as Russian and English quotation mark in the same text. I suggest to introduce the corresponding closing quotation mark. Thank you.

Date/Time: Tue Jul 12 17:06:19 CDT 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/11-267 Proposal to encode svara markers for the Jaiminiya Archika

It is unclear why the proposed U+1CF8 VEDIC TONE RING ABOVE is not to be unified with the ordinary U+030A COMBINING RING ABOVE. If this is not to be done, some justification should be given for the disunification.

Closed Issues

No feedback was received via the reporting form this period.

Other Reports - UTS #10 (Sorting)

Date/Time: Fri May 20 08:53:24 CDT 2011
Contact: Sergiusz.Wolicki@oracle.com
Name: Sergiusz Wolicki
Report Type: Error Report
Opt Subject: UTS #10, 3.3.2 Contractions -- error or unclear

Dear Sirs,

In UTS #10 "Unicode Collation Algorithm", section 3.3.2 "Contractions", there is the following paragraph:

"Any character (such as soft hyphen) that is not completely ignorable between two characters of a contraction will cause them to sort as separate characters. Thus a soft hyphen can be used to separate and cause distinct weighting of sequences such as Slovak ch or Danish aa that would normally be weighted as units."

When looking into DUCET, I see:

00AD ; [.0000.0000.0000.0000] # [00AD] SOFT HYPHEN

which means, as I understand, that the soft hyphen is completely ignorable on all levels. This seems to contradict the paragraph above: "...(such as soft hyphen) that is _not_ completely ignorable..."

Is there an error in the cited paragraph or does "completely ignorable between two characters" means something else than being ignorable on all levels. If latter is the case, some explanation would be welcome to clarify and remove confusion. For example, is the paragraph meant to say:

"Any character, except non-starters described in the next paragraph, inserted between characters of a contraction, will cause those characters to sort as separate characters. Thus, a character ignorable on all levels, such as soft hyphen, can be used to separate and cause distinct weighting of sequences such as Slovak ch or Danish aa that would normally be weighted as units."

Thanks and best regards,

Sergiusz Wolicki
Oracle Development - Server Globalization Technology

Date/Time: Mon Jun 13 10:02:25 CDT 2011
Contact: makholm@octoshape.com
Name: Henning Makholm
Report Type: Error Report
Opt Subject: UTS #10 section 6.3.1 is misleading

The optimization described in section 6.3.1 ("Contiguous Weight Ranges") of UTS#10 revision 22 is inapplicable to the current DUCET -- and presumably also to systematically created tailorings of it.

The only collation elements in the 6.0.0 DUCET whose secondary weight differs from 0020 are the primary ignorables. So the primary value 0000 has 316 distinct associated secondary values (which are too many to be compressed into a single byte), and all other primary values have only the secondary value 0020.

It appears that section 6.3.1 is a leftover from revision 6 (and earlier) whose DUCET contained special single collation elements for precomposed characters. The number 18 in the second paragraph of 6.3.1 must refer to the number of different secondary weights in the SECOND collation element for existing precomposed characters that decompose to something starting with "O" or "o". But these secondary weights cannot conformantly be associated with the primary weight 1724 anyway, because then the string <U+006F U+030A> (latin small letter o with ring above) would not sort correctly between U+01D2 (latin small letter o with caron) and U+00F6 (latin small letter o with diaeresis).

I suggest that the section is removed completely from the standard.

Date/Time: Mon Jun 13 10:59:27 CDT 2011
Contact: makholm@octoshape.com
Name: Henning Makholm
Report Type: Error Report
Opt Subject: UTS#10: Fix collation order for U+214D AKTIESELSKAB

DUCET currently maps U+214D AKTIESELSKAB to a variable-weighted collation element. I think it should instead collate like "A/S" <U+0041 U+002F U+0033> with a tertiary difference.

This would provide consistency with other signs with a similar structure, such as U+2100, U+2101, U+2106, U+2107.

One difference is that U+2100 etc. all have compatibility decompositions to their ASCII equivalents (making their collation expansions automatic). It is not clear to me why U+214D doesn't. It seems to have been an explicit decision by the UTC (minutes of 2004-11-18), but its reasoning is apparently not made public. Comments in JTC1/SC2/WG2 working document N2887 imply that it was felt that it must decompose to something involving individual superscript/superscript characters or not at all, which is strange because the existing decompositions of U+2100 etc. go directly to ASCII.

In any case, it is now too late to give U+214D an official decomposition -- which would break normalization stability -- but the DUCET, being less encumbered by stability guarantees, can and should be amended to fix the omission at least partially.

Other Reports - UTR #25 (Math)

Date/Time: Fri Jun 10 11:04:15 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: UTR #25 - duplicate (or incorrect) entry in MathClass-12.txt


I note that the third line of data (not counting comment lines and empty lines) is probably wrong, and duplicates the entry for U+0021, instead of assigning a Math Class to U+0022:

0020;S 0021;N 0021;N #Most probably 0022 ? If so, is it class N (Normal) ? 0023;N [...]

Or may be this is just a duplicate line (if U+0022, i.e. the ASCII double quote, should be excluded from mathematics notations, due to its various interpretations in ASCII and many legacy fonts, and possible confusion with double primes).

This datafile would be easier to check for errors, if it consistantly noted all unbroken ranges of two or more Unicode characters assigned to the same Math Class, using the ".." notation (already used for Basic Latin letters, and basic decimal digits), notably for easier finding ranges of reserved/unassigned codepoints, or ranges of code points assigned to Unicode characters but currently not to any Math Class.

It would also make the file a bit shorter in length mesured in bytes, and much shorter if measurd in number of lines. Nothing would be changed for parsing this file as the range notation is already used in it.

Date/Time: Fri Jun 10 11:51:24 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: Parsing problems for MathClassEx-12.txt

MathClassEx-12.txt causes various parsing problems, because of the presence in field 3 (labelled "char" in its header line) of an unescaped version of the associated character. When this character is a semicolon (;) or a hash (#), this causes tricky parsing problems.

My opinion is that this "char" field is only informative and should be always put at end of the line, after a comment hash (#) followed by a regular SPACE (for convenient display when the character is combining) and preferably preceded by a SPACE or TAB (for convenient alignment in most cases) so that the line can be parsed by first dropping everything after the first hash (if there's one), and by simple splitting of fields on every semicolon, and then trimming blanks in fields.

Also, if a character has several entity names that are synonyms (such as U+0021, whose character name is definitely not "FACTORIAL", that is another usage/note), these names should be simply comma-separated, to avoid the clash on the interpretation of the other fields. The same separation by commas could be used as well if the character is part of several entity sets.

For example:

#code point;class;char;entity name(s);entity set(s);note/description;CHARACTER NAME
0021;N;!;excl;ISONUM;Factorial spacing;EXCLAMATION MARK
0024;N;$;dollar;ISONUM;;DOLLAR SIGN
0025;N;%;percnt;ISONUM;;PERCENT SIGN
002A;N;*;ast;ISONUM;[high, not /ast];ASTERISK
002B;V;+;plus;;;PLUS SIGN
002D;N;-;;;(deprecated for math) ;HYPHEN-MINUS
002E;P;.;period;ISONUM;period;FULL STOP
002F;B;/;sol;ISONUM;No extra spacing, stretchy;SOLIDUS
0030..0039;N;0..9;;;;DIGIT 0..9

should better be formatted as:

#code point;class;entity name(s);entity set(s);note/description;CHARACTER NAME# char
0020;S;;;;SPACE #
0021;N;excl,fact;ISONUM;Factorial spacing;EXCLAMATION MARK # !
0024;N;dollar;ISONUM;;DOLLAR SIGN# $
0025;N;percnt;ISONUM;;PERCENT SIGN# %
002A;N;ast;ISONUM;[high, not /ast];ASTERISK # *
002B;V;plus;;;PLUS SIGN # +
002C;P;comma;ISONUM;;COMMA # ,
002D;N;;;(deprecated for math) ;HYPHEN-MINUS # -
002E;P;period;ISONUM;period;FULL STOP # .
002F;B;sol;ISONUM;No extra spacing, stretchy;SOLIDUS # /
0030..0039;N;;;;DIGIT 0..9 # 0..9
003A;P;colon;ISONUM;;COLON # :
003D;R;equals;ISONUM;;EQUALS SIGN # =
0040;N;commat;ISONUM;;COMMERCIAL AT # @

Also, the (optional) note/description could be brought at end as well, after the char field already in the comments. In that case the format would become:

#code point;class;entity name(s);entity set(s);CHARACTER NAME # char;opt.note/description
0020;S;;;;SPACE #  ;
0021;N;excl,fact;ISONUM;EXCLAMATION MARK # !;Factorial spacing
0024;N;dollar;ISONUM;DOLLAR SIGN# $
0025;N;percnt;ISONUM;PERCENT SIGN# %
002A;N;ast;ISONUM;ASTERISK # *;[high, not /ast]
002B;V;plus;;PLUS SIGN # +
002C;P;comma;ISONUM;;COMMA # ,
002D;N;;;HYPHEN-MINUS # -;(deprecated for math)
002E;P;period;ISONUM;period;FULL STOP # .
002F;B;sol;ISONUM;SOLIDUS # /;No extra spacing, stretchy
0030..0039;N;;;DIGIT 0..9 # 0..9
003A;P;colon;ISONUM;COLON # :
003D;R;equals;ISONUM;EQUALS SIGN # =
0040;N;commat;ISONUM;COMMERCIAL AT # @

The separator between the "char" and optional note/description is not necessarily a semicolon, it could be as well a space, probably more readable:

#code point;class;entity name(s);entity set(s);CHARACTER NAME # char opt.note/description
0020;S;;;SPACE #  
0021;N;excl,fact;ISONUM;EXCLAMATION MARK # ! Factorial spacing
0024;N;dollar;ISONUM;DOLLAR SIGN# $
0025;N;percnt;ISONUM;PERCENT SIGN# %
002A;N;ast;ISONUM;ASTERISK # * [high, not /ast]
002B;V;plus;;PLUS SIGN # +
002C;P;comma;ISONUM;;COMMA # ,
002D;N;;;HYPHEN-MINUS # - (deprecated for math)
002E;P;period;ISONUM;period;FULL STOP # .
002F;B;sol;ISONUM;SOLIDUS # / No extra spacing, stretchy
0030..0039;N;;;DIGIT 0..9 # 0..9
003A;P;colon;ISONUM;COLON # :
003D;R;equals;ISONUM;EQUALS SIGN # =
0040;N;commat;ISONUM;COMMERCIAL AT # @

Date/Time: Fri Jun 10 12:25:28 CDT 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: Parsing problems for MathClassEx-12.txt

Anyway, I see absolutely no difference between the content of MathClassEx-12.txt and MathClass-12.txt, because all the "extra" fields are only informative, and may deviate from the standard : - the "char" field is absolutely not needed (just there for convenience if editing or viewing this file in an editor) - the Unicode character name should be normative in the main UCD file, and only there for convenience only - the entity name(s) and entity set(s) should be taken from the much better file stored in the "SGML.txt" file in the "/Public/MAPPINGS/VENDORS/MISC" directory, which centralizes these SGML entity names and entity sets.

The only important fields are those that are assigning specific MathClass to a subset of the UCS repertoire used in MathML (only the first two fields).

In other words, the extended file "MathClassEx-12.txt" should never become normative, but only derived for the simpler "MathClass-12.txt" ; however, if the extended file is used as the editable source for creating "MathClass-12.txt" through a simple automated tool (which should be able to compact it automatically a lot using range notations, even between lines not using ranges in the extended file), it should really be parsable without any ambiguities, and not used as the reference file for implementing MathML processors.

Having two files for exactly the same purpose will just be a source of errors, I just recommand using only one, and if you want extra informative fields, put them all at end of the line after the sharp (#) separator.

May be also, the SGML mappings file should become part of the UNIDATA directory itself, instead of just in the informative/unsupported MAPPINGS, because the SGML entity names (may be not the entity sets which are historic and probably deprecated, except if these entity sets are refering to the name of a sub-DTD where these entities are defined from a reference standard) are actually parts of modern standards that ARE supported by Unicode liaison members (notably in implementations of HTML and MathML).

In fact I would really suggest a new separate file into the UCD only for the entity names supported in standards based or derived from SGML (with entity sets listed only in an informative comment field)...

And possibly another separate file for the character names used and recommanded in TrueType/OpenType fonts for compatibility with Type1 Postcript fonts and processors like PDF readers. Adobe documents such a list for all character/glyph names, and some synonyms in legacy fonts, all others glyphs being assigned names now algorithmically like "uXXXX" or "uniXXXXXX", with optional but recommanded dot-extensions and combinations of characters using underscores to separate character names part generating that glyph (search for "Adobe Glyph Names").

Other Reports

Date/Time: Wed Jun 29 16:22:31 CDT 2011
Contact: lorna_priest@sil.org
Name: Lorna Priest
Report Type: Error Report
Opt Subject: Chapter 10: Lepcha ra and ya

There seems to be an error in two places in the Lepcha section of Chapter 10. The -ra and -ya codepoints are getting mixed up.

Under Medials they are correct (page 325)

Under Retroflex Consonants the LEPCHA SUBJOINED LETTER RA is incorrectly given the codepoint of U+1C24 in two places. It should be U+1C25.

On page 326, Table 10-2 it lists medial -ra as U+1C24 and medial -ya as U+1C25. The codepoints (and glyphs) are wrong. Medial -ya is U+1C24 and medial -ra is U+1C25. I think the Class order in the chart is correct according to the original Unicode proposal, it is the Example and Encoding that are incorrect.

Then, a question. In the Unicode proposal L2/05-158R it gives the correct order as
and Table 10-2 would indicate the order to be:

I'm wondering if this was deliberately changed or if this is also an error.


Date/Time: Mon Jul 4 18:18:10 CDT 2011
Contact: ran.arigur@gmail.com
Name: Ran Ari-Gur
Report Type: Error Report
Opt Subject: Small inconsistency: noBreak is a valid LCTAG.

The NamesList.html data-file has an LCTAG production, used for compatibility formatting tags. That production is given as "sequence of lowercase ASCII letters". However, NamesList.txt uses the compatibility formatting tag <noBreak> (with a capital B) several times, and that accords with the Standard (section 17.1, page 552). I don't think <noBreak> is a problem; so, I think that NamesList.html should give LCTAG as something like, "sequence of ASCII letters, starting with a lowercase letter".