L2/07-163

Comments on Public Review Issues
(January 31, 2007 - May 11, 2007)

The sections below contain comments received on the open Public Review Issues as of May 11, 2007, since the previous cumulative document was issued prior to UTC #110 (January 2007).

96 Allowing Joiner Characters in Identifiers
97 Proposed Draft UTR #38, A User's Guide to the Unihan Database
101 Proposal to Encode an External Link Sign
102 Proposed Update to UAX #15: Unicode Normalization Forms
103 Proposed Update to UAX #29: Text Boundaries
104 Proposed Update to UAX #31: Identifier and Pattern Syntax
105 Proposed Update to UAX #14: Line Breaking Properties
106 Proposed Update to UAX #11: East Asian Width
Other Reports
Closed Public Review Issues

96 Allowing Joiner Characters in Identifiers

Date/Time: Tue May 1 02:59:00 CST 2007
Contact: andrewcwest@gmail.com
Name: Andrew West

> C. Mongolian Separators (NNBSP or MVSs)

The use of the initialism "MVS" is ambiguous. I assume that it is meant to stand for "Mongolian Variation Selector" here. However, "MVS" is normally used as an abbreviation for U+180E MONGOLIAN VOWEL SEPARATOR, whereas U+180B..U+180D are abbreviated as "FVS1", "FVS2" and "FVS3", where "FVS" stands for "Free Variation Selector" (see the code chart annotations).

> $MS = [\u202F \u180B \u180C \u180D]

I wonder if U+180E MONGOLIAN VOWEL SEPARATOR should also be considered to be a format character, as it has an effect on the shaping of adjacent characters similar to that of NNBSP. MVS [180E] is certainly required for use in identifiers, but it is not clear from PRI #96 whether it is allowed or excluded.

97 Proposed Draft UTR #38, A User's Guide to the Unihan Database

Contact: mpsuzuki@hiroshima-u.ac.jp
Name: suzuki toshiya

Dear Sirs,

I have 2 questions on Unihan.txt

(1) kIRG_GSource has unknown codepoint syntax.

In the defintion of Syntax for kIRG_GSource: (4K|BK|CH|CY|FZ(_BK)?|HC|HZ|KX|[0135789ES]-[0-9A-F]{4}),
0-xxxx will mean G0 GB2312-80
1-xxxx will mean G1 GB12345-90 with 58 Hong Kong and 92 Korean "Idu" characters
3-xxxx will mean G3 GB7589-87 unsimplified forms
5-xxxx will mean G5 GB7590-87 unsimplified forms
7-xxxx will mean G7 General Purpose Hanzi List for Modern Chinese Language, and General List of Simplified Hanzi
S-xxxx will mean GS Singapore characters
8-xxxx will mean G8 GB8685-88
E-xxxx will mean GE GB16500-95,
but 9-xxxx is not explained. Some entry might be dropped from GB standards list.

(2) kIRG_JSource looks like as if it has duplicated tags.

kIRG_JSource codepoint syntax is defined as: Syntax: ([0134A]|3A)-[0-9A-F]{4}

But in the list of JIS standards, JIS X 0213:2000 are listed twice, it's difficult to understand what is different between 3-xxxx and 4-xxxx. J4 was intended to be JIS X 0213:2004?
    * J0 JIS X 0208:1990
    * J1 JIS X 0212:1990
    * J3 JIS X 0213:2000
    * J4 JIS X 0213:2000
    * JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
    * J3A JIS X 0213:2004 level-3 
Regards, mpsuzuki

101 Proposal to Encode an External Link Sign

No feedback was received via the reporting form this period.

102 Proposed Update to UAX #15: Unicode Normalization Forms

Date/Time: Thu Mar 22 06:25:39 CST 2007
Contact: Per.Mildner@sics.se
Name: Per Mildner

Feedback on http://www.unicode.org/reports/tr15/tr15-28.html

"Figure 7 shows a sample of the how the composition process works. The dark green cubes represent starters, and the light gray cubes "

1. No cubes are "dark green", probably "dark grey" was intended. No cubes looks "light gray", they look white.

2. The arrows are not explained.

3. The figure is incomprehensible. Perhaps example characters should be added to the boxes, or some such.

"This table is constructed on the premise that the text is being normalized (thus the processing is in the middle of R1.2 or R2.2), and that thus the first character has thus been composed if possible" perhaps the last "thus" should be removed?

"If there are no such characters, then it is possible for it to be to be added or omitted from the composition exclusion table" should remove one "to be"

"21.1 Buffering with Unicode Normalization" The "Decomposition" starting with "u" shows buffer positions 0, 1, and 32. Should this not be 0, 1 and 2?

Name: Ienup Sung
Date: 2007-04-06 23:19:59 -0700
Subject: Compatibility decomposition at UAX #15

The revision 27 and also proposed update to UAX #15 (revision 28) has the following canonical decomposition and compatibility decomposition descriptions at the section 10:

Canonical decomposition is the process of taking a string, recursively replacing composite characters using the Unicode canonical decomposition mappings (including the algorithmic Hangul canonical decomposition mappings; see Section 16, Hangul), and putting the result in canonical order.

Compatibility decomposition is the process of taking a string, replacing composite characters using both the Unicode canonical decomposition mappings and the Unicode compatibility decomposition mappings, and putting the result in canonical order.

I think the compatibility decomposition definition at the above is a possible source of confusion for not so careful readers since the above definition differs from the D65 (or D20 of older versions): U+02DF D65 Compatiblity decomposition: The decomposition of a character that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Section 3.12, Conjoining Jamo Behavior, until no characters can be further decomposed, and then reordering nonspacing marks according to Section 3.11, Canonical Ordering Behavior.

I'd like to propose to change the UAX #15 text for the compatibility decomposition into something like:

Compatibility decomposition is the process of taking a string, recursively replacing composite characters using both the Unicode canonical decomposition mappings (including the algorithmic Hangul canonical decomposition mappings) and the Unicode compatibility decomposition mappings, and putting the result in canonical order.

Ienup

103 Proposed Update to UAX #29: Text Boundaries

No feedback was received via the reporting form this period.

104 Proposed Update to UAX #31: Identifier and Pattern Syntax

Date/Time: Thu Mar 22 07:46:26 CST 2007
Contact: Per.Mildner@sics.se
Name: Per Mildner

http://www.unicode.org/reports/tr31/tr31-8.html

"If an implementation needs to ensure full for canonical equivalence of identifiers ..." full for what?

105 Proposed Update to UAX #14: Line Breaking Properties

Date/Time: Thu Mar 29 11:54:09 CST 2007
Contact: mattias.ellert@tsl.uu.se
Name: Mattias Ellert

U+02DF (˟) should have the same line breaking category as U+02C8 (ˈ), i.e. BB, since they have the same basic usage, i.e. to mark the stress of the following syllable. See for example the scanned page of Swedish-English dictionary found here:

http://www3.tsl.uu.se/~ellert/10646/prisma.jpg

Here U+02C8 and U+02DF are used to mark the two different kinds of stress in the Swedish language.

106 Proposed Update to UAX #11: East Asian Width

No feedback was received via the reporting form this period.

Other Reports

Date/Time: Thu Mar 22 05:58:04 CST 2007
Contact: Per.Mildner@sics.se
Name: Per Mildner
Subject: Corrigenda does not say what version of TUS it applies to

The page http://www.unicode.org/errata/index.html lists corrigenda (#1 to #5) but only a few of these mention what versions of TUS they apply to. For instance, when reading TUS 5.0, do I need to care about corrigenda #5 or has the problem already been fixed in TUS 5? There is no easy way to tell.

Date/Time: Tue Jan 30 12:58:18 CST 2007
Contact: Antoine10646@Leca-Marti.org
Name: Antoine Leca
Report Type: Public Review Issue #100
Opt Subject: Giving U+00B7 MIDDLE DOT the ID_Continue Property

Dear Sirs:

Please do so.

The change is as simple as giving U+00B7 the property Other_ID_Continue. Changing the "Po" category might also be a possible and effective solution, however it is probably much less easy to enact; still, I consider it should be done if at all possible.

This change have already been done for the ISO standard of one of the main programming languages, namely ISO/CEI 1989:2003 (COBOL).

This character is (despite the "legacy" qualification used in UAX31) the one which is used to encode the orthographical feature invented by Pompeu Fabra at the beginning of the 20th century (concurrently with U+002E ., but it is clear to anybody that this latter cannot be used for identifiers; furthermore, using U+002E is seen as inferior, and can readilly be qualified of "legacy"; also about "legacy", the alternative, i.e. using U+013F Ŀ and U+0140 ŀ have been dropped in Spain around 1980, when the usage and the typewriter keyboard layouts evolved from having occasionally a Ŀ key (usually in the lower right corner), to the present situation (acknowledged in the naming list of the Unicode Standard, with the compatibility mappings of 013F and 0140) where essentially all typing keyboards in Spain (the layout is different as the one used in Latin America) show the "punt mig" or "punt volat" (flying dot) in the shift position of the 3 key.

Here it is important to remember that Spain has more than 40 millions inhabitants, with a high life standard which means computers are very common; and Catalan is the official language of about 1/4th of Spain (Catalonia, Valencia and Balears), where it is taught and used in business and legal affairs.

When I asked ten years ago (while preparing C99) about the reason for this exception, the only reason I was told was that for Americans, · means the multiplicative operator, and for this reason this character should be avoided (I saw it about the same as prohibiting Ø because it could be interpreted as the empty set, however I am a young European engineer without university diploma so I could not argue at the time). I now read UAX31 addresses this issue and recommend firmly the use of U+2219 or U+22C5 for this mathematical use, so this should not an issue any more.

The handling of this character has been historically chaotic (for example, when Catalonia enabled the use of Catalan for the Civil register, a bug prevented the registration of names with ·, despite it being used frequently with first names). These kind of stories, along with the ones with Ã~Q also quite frequent here in Spain, are often a base to bash the poorly internationalized applications sold in foreign countries. It should not be the same for Unicode, at any rate if it is possible to avoid it; and here it seems pretty clear that the easiest and cleaner move is to make the U+00B7 character a possible ID_Continue.

Best regards,

-- Antoine Leca
Corbera (Valencia, Spain)

Date/Time: Thu May 3 02:58:22 CST 2007
Contact: vunzndi@vfemail.net
Name: John Knightley
Subject: PRI 98: Combined registration of the Adobe-Japan1 collection and of sequences in that collection

Dear Unicode,

Reguarding this collection there are a large number of characters included that under the normal understanding of Annex S would be given a different code point to that suggested in pri98-partialcharts.pdf.

Including pri98 under Ammendment 4 which n3256.pdf suggests might be the case would be very premature and IMHO might well cause long term problems.

Below are some exapmples of charcaters that would not normally be unified under the priciples described in Annex S.

page 4

U+56C0 VS19-20096 would not be unified with the other two glyphs shown

U+56C3 VS17-4453 would not be unified with VS18-20097

page 6

U+5EA7 the three glyphs shown would not normally be unified

page 13

U+75D9 VS17-5746 and VS18-20176 the non-identical components here is ones which in Annex S S1.4.3 are listed as two components that should not be unified.

Rather than list more examples it would seem best to wait for a reply. The shortness of the list here does not mean that there are no other cases, but rather that a reply is required to know best how to resolve this issue.

To repeat myself a little, VS18-20176 is an unicode character which according to Annex S S1.4.3 would not be unified with U+75D9, all that is required is for evidence as to it's existance as a character and a proposal to have it encoded would in the due process of time lead to it being assigned a codepoint other than U+75D9. There are other unencoded characters in pri98, IMHO the correct procedure for these is to encode them, and not not encode these means that the proposed Adobe-Japan1 IVD collection will not correctly support in Unicode plain text the distinctions which are made by the Adobe-Japan1 Character collection.

Yours sincerely
John Knightley

L2/07-163

Comments on Public Review Issues (January 31, 2007 - May 11, 2007)

Contents:

Comments on Public Review Issues
(January 31, 2007 - May 11, 2007)