Comments on Public Review Issues (May 13, 2006

L2/06-274

Comments on Public Review Issues
(May 13, 2006 - August 3, 2006)

The sections below contain comments received on the open Public Review Issues as of August 3, 2006, since the previous cumulative document was issued prior to UTC #107 (May 2006).

Contents:

75 Proposed Update UTR #25, Unicode Support for Mathematics
83 Changing Glyph for U+047C/U+047D Cyrillic Omega with Titlo
90 Unicode 5.0 Beta 2
91 Proposed Update to UAX #9: The Bidirectional Algorithm
93 Representation of Malayalam /au/ Vowel in Traditional and Reformed Orthography

75 Proposed Update UTR #25, Unicode Support for Mathematics

No feedback was received via the reporting form this period.

83 Changing Glyph for U+047C/U+047D Cyrillic Omega with Titlo

Date/Time: Tue Jul 25 07:48:25 CDT 2006
Contact: [email protected]
Name:
Report Type: Public Review Issue
Opt Subject: 83 Changing Glyph for U+047C/U+047D Cyrillic Omega with Titlo

Strongly against.

Reason: now the name is "omega with titlo", and glyph is omega with titlo (a bit ugly, but still recognizeable). Proposed new glyph is not omega with titlo. It is what is known as "omega s velikim apostrofom" (omega with great apostrophe): a quite different object. Both omega with titlo and omega with great apostrophe are real glyphs, used in the same books for different purposes: omega with titlo is the numerical sign for 800 (rarely used, because mostly the role is played by the letter "ot"), omega with great apostrophe is the letter used in exclamations "o!" and "ole!".

90 Unicode 5.0 Beta 2

From: Nick Nicholas
Date: 2006-06-16 19:48:35 -0700
Subject: Re: Draft 3 of beta code charts (final review)

I recommend annotating 1DC0 and 1DC1 as "Ancient editorial sign", to prevent confusion with dialytika: as the name indicates, these are not diaereses, but in fact scribal deletions of erroneous acute and grave.

Date/Time: Sun Jun 18 18:33:47 CDT 2006
Contact: [email protected]
Name: Karl Pentzlin
Report Type: Error Report
Opt Subject: Glyph of U+0347 in 5.0 Draft 3 beta code chart

Sirs, I think the glyph of U+0347 COMBINING EQUALS SIGN SYMBOL in the Unicode 5.0 Draft 3 beta chart at http://www.unicode.org/Public/5.0.0/charts/CodeCharts-5.0.0d3.pdf is somewhat misleading. The two dashes are significally wider than the dotted circle and are almost as wide as the dashes in U+0333. Looking at the "Handbook of the International Phonetic Association", Cambridge 1999 (Reprint of 2003), p.191, I find the glyph shown there significantly less wide as the p or b bowl which the diacritic character is shown together with, and even somewhat less wide than the v it is shown together with. Shown together with an m, it is less than half as wide as the m. It is about as wide as the diacritic U+033C shown there. Therefore, I assume that the dashes in the reference glyph of U+0347 in the Unicode charts should be considerably less wider, about as wide as the interior of the dotted circle or the glyph part below the dotted circle of U+033C.

- Karl Pentzlin

Date/Time: Sat Jul 8 14:35:46 CDT 2006
Contact: [email protected]
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: Uppercase of dotless j is J

UnicodeData.txt:

change: 0237;LATIN SMALL LETTER DOTLESS J;Ll;0;L;;;;;N;;;;;

to: 0237;LATIN SMALL LETTER DOTLESS J;Ll;0;L;;;;;N;;;004A;;004A

since the uppercase of dotless j is ordinary capital J (compare dotless i). See http://www.evertype.com/standards/iso10646/pdf/dotless-j.pdf.

Date/Time: Sat Jul 8 17:04:24 CDT 2006
Contact: [email protected]
Name: Christopher Yeleighton
Report Type: Error Report
Opt Subject: Intervals in Id_Start are not separated
# DerivedCoreProperties-4.1.0.txt
# Date: 2005-03-10, 02:04:29 GMT [MD]
Lines 2656-2658:
212A..212D    ; ID_Start # L&   [4] KELVIN SIGN..BLACK-LETTER CAPITAL C
212E          ; ID_Start # So       ESTIMATED SYMBOL
212F..2131    ; ID_Start # L&   [3] SCRIPT SMALL E..SCRIPT CAPITAL F
Problem: These intervals are not separated. I convert them to open-close intervals in order to use the standard algorithm std::upper_bound on them and I get the sequence
(020463 ,020465]
(020465 ,020471]
(020471 ,020472]
My strategy was to test whether the upper bound position is odd but it works only when the intervals are separated. When they are not, as is the case here, the upper bound position for the value at the joint is randomly selected of the two matching values and my algorithm fails miserably. I have to glue the contingent intervals together, making (020463 ,020472] of the three. I thought it would not do any harm to introduce this simple correction into the database by replacing the three lines quoted above with just one line. It was only an example; there are several such joints throughout the properties ID_Start and ID_Continue.

BTW: Why can a number letter form an identifier? Are you sure this is correct?
Date/Time: Sun Jul 9 17:51:30 CDT 2006
Contact: [email protected]
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: SpecialCasing.txt

Some changes are needed in SpecialCasing.txt in order to 1) preserve canonical equivalence across case mapping in the tr, az, and lt locales, and a better way of preserving canonical equivalence in other cases, as well as correcting the case mapping of small i with circumflex (which is used in Turkish).

new conditions: After_J (similar to After_I), Not_More_Above (negation of More_Above).

retired condition: Not_Before_Dot (the more general Not_More_Above is used instead).

SpecialCasing.txt:
	delete: 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
	new: 0307; ; 0307; 0307; After_I Not_More_Above; # COMBINING DOT ABOVE
	reason: better way to preserve canonical equivalence over case mapping, does not introduce i with dot above needlessly

	new: 0307; ; 0307; 0307; After_J Not_More_Above; # COMBINING DOT ABOVE
	reason: preserves canonical equivalence over case mapping, yet does not introduce j with dot above needlessly

      tr/az
	replace: 0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
	replace: 0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
	replacement: 0049; 0131; 0049; 0049; tr Not_More_Above; # LATIN CAPITAL LETTER I
	replacement: 0049; 0131; 0049; 0049; az Not_More_Above; # LATIN CAPITAL LETTER I
	reason: generalisation needed for small i with circumflex (used in Turkish) and other (possible) cases

	replace: 0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
	replace: 0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
	replace: 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
	replace: 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
	replacement: 0307; ; 0307; 0307; tr After_I Not_More_Above; # COMBINING DOT ABOVE
	replacement: 0307; ; 0307; 0307; az After_I Not_More_Above; # COMBINING DOT ABOVE
	replacement: 0130; 0069; 0130; 0130; tr Not_More_Above; # LATIN CAPITAL LETTER I WITH DOT ABOVE
	replacement: 0130; 0069; 0130; 0130; az Not_More_Above; # LATIN CAPITAL LETTER I WITH DOT ABOVE
	reason: to match the above replacement (note that the old version also missed a Not_Before_Dot condition)

	replace: 0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
	replace: 0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
	replacement: 0069; 0069; 0130; 0130; tr Not_More_Above; # LATIN SMALL LETTER I
	replacement: 0069; 0069; 0130; 0130; az Not_More_Above; # LATIN SMALL LETTER I
	reason: the condition Not_More_Above is needed for 1) small i with circumflex (used in Turkish), and
		2) for preserving canonical equivalence for various "i with above mark" characters also for the tr and az locales

     lt
	new: 1E2C; 1E2D 0307; 1E2C; 1E2C; lt More_Above; # LATIN CAPITAL LETTER I WITH TILDE BELOW
	new: 1ECA; 1ECB 0307; 1ECA; 1ECA; lt More_Above; # LATIN CAPITAL LETTER I WITH DOT BELOW
	new: 00CE; 0049 0307 0302; 00CE; 00CE; lt; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
	new: 00CF; 0049 0307 0308; 00CF; 00CF; lt; # LATIN CAPITAL LETTER I WITH DIAERESIS
	new: 0130; 0049 0307 0307; 0130; 0130; lt; # LATIN CAPITAL LETTER I WITH DOT ABOVE
	new: 0128; 0049 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE
	new: 012A; 0049 0307 0304; 012A; 012A; lt; # LATIN CAPITAL LETTER I WITH MACRON
	new: 012C; 0049 0307 0306; 012C; 012C; lt; # LATIN CAPITAL LETTER I WITH BREVE
	new: 01CF; 0049 0307 030C; 01CF; 01CF; lt; # LATIN CAPITAL LETTER I WITH CARON
	new: 0208; 0049 0307 030F; 0208; 0208; lt; # LATIN CAPITAL LETTER I WITH DOUBLE GRAVE
	new: 020A; 0049 0307 0311; 020A; 020A; lt; # LATIN CAPITAL LETTER I WITH INVERTED BREVE
	new: 1E2E; 0049 0307 0308 0301; 1E2E; 1E2E; lt; # LATIN CAPITAL LETTER I WITH DIAERESIS AND ACUTE
	new: 1EC8; 0049 0307 0309; 1EC8; 1EC8; lt; # LATIN CAPITAL LETTER I WITH HOOK ABOVE
	new: 0134; 006A 0307 0302; 0134; 0134; lt; # LATIN CAPITAL LETTER J WITH CIRCUMFLEX
	reason: needed for preserving canonical equivalence across case mapping also in the lt locale
Date/Time: Sun Jul 16 04:23:00 CDT 2006
Contact: [email protected]
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: case mapping (again)
To finish off my little series "dotting the i's" are two consistency suggestions:

1) in UnicodeData.txt:

025F;LATIN SMALL LETTER DOTLESS J WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER DOTLESS J BAR;;;;

should have an uppercase of 0248, LATIN CAPITAL LETTER J WITH STROKE.

Compare softdotted and dotless [×] i and j and softdotted j-bar.

2) in SpecialCasing.txt
Replace: # Remove DOT ABOVE after "i" with upper or titlecase
Replace: 0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE

Replacement:# Remove DOT ABOVE after "i" or "j" before more accents above with upper or titlecase
Replacement:0307; 0307; ; ; lt After_Small_i More_Above; # COMBINING DOT ABOVE
Replacement:0307; 0307; ; ; lt After_Small_j More_Above; # COMBINING DOT ABOVE

Reason: the corresponding lowercasing operation is limited to I and J and when there are more accents above; the uppercasing operation should be made as similar as possible.
Note that NONE of my suggestions regarding case mappings affect CaseFolding.txt which is now, IUUC, under a stability policy.
Date/Time: Wed Jul 19 03:36:23 CDT 2006
Contact: [email protected]
Name: Stefano Priore
Report Type: Error Report
Opt Subject: Uncorrect capitalization of SI code points

I think that the capitalization rules for the following code points (as found in CaseFolding.txt) require corrections.

1) MICRO SIGN (U+00B5)

This codepoint maps to GREEK SMALL LETTER MU: even if the glyphs are visually equivalent, the case folding changes the semantics of the symbol - due to its low position in the BMP, and as implied by its name (MICRO SIGN) many legacy text documents use this code point to represent the SI units multiplier 10^-6 (one millionth)

Proposed action: U+00B5 maps to itself

2) OHM SIGN (U+2126), KELVIN SIGN (U+212A), ANGSTROM SIGN (U+212B)

The case folding rules for these symbols are plain wrong: as implied by their name, these code points are used to represent SI units and as such, they should never be case folded since the operation replaces them with symbols that have different or no meaning in SI context.

Proposed action: code points map to themselves

Date/Time: Wed Jul 19 03:42:03 CDT 2006
Contact: [email protected]
Name: Stefano Priore
Report Type: Error Report
Opt Subject: Parenthesized latin letters have no capital form correspondence

There are no code points to represent the capital form of the parenthesized latin letters (U+249C ... U+24B5). Doesn't this counteract what's stated in the case folding policy?

Date/Time: Wed Jul 19 04:08:47 CDT 2006
Contact: [email protected]
Name: Stefano Priore
Report Type: Error Report
Opt Subject: Supplemental code points for numbers in "Enclosed Alphanumerics" block
The Enclosed Alphanumerics block contains various code points used to represent numbers from zero to twenty in different graphical shapes.

However, these sets haven't one-to-one correspondence: the adoption of the following code points would make these number sets "orthogonal".
PARENTHESIZED DIGIT ZERO
DIGIT ZERO FULL STOP
DOUBLE CIRCLED DIGIT ZERO
DOUBLE CIRCLED NUMBER ELEVEN ... DOUBLE CIRCLED NUMBER TWENTY
NEGATIVE CIRCLED DIGIT ONE ... NEGATIVE CIRCLED DIGIT NINE
NEGATIVE CIRCLED NUMBER TEN
Even if this looks like a minor issue, it would help programmers of text applications to provide a coherent (both in appearance and semantics) set of numeric "bullet points" for list, tables, etc.

Of course, should one of these ranges be augmented (due to future additions or discovery of other glyphs buried in some legacy/compatibility block), it would be useful to keep the one-to-one correspondence.
Date/Time: Wed Jul 26 07:51:57 CDT 2006
Contact: <[email protected]>
Name: SADAHIRO Tomoyuki
Report Type: Error Report
Opt Subject: WordBreakProperty-5.0.0.txt doesn't list NBSP as ALetter.

According to the current UAX#29 [1], its proposed update [2] and the file WordBreakProperty-4.1.0.txt [3], U+00A0 (NO-BREAK SPACE) belongs to ALetter. But WordBreakProperty-5.0.0.txt [4] doesn't list NBSP as ALetter.

[1] http://www.unicode.org/reports/tr29/
[2] http://www.unicode.org/reports/tr29/tr29-10.html
[3] http://www.unicode.org/Public/4.1.0/ucd/auxiliary/WordBreakProperty.txt
[4] http://www.unicode.org/Public/5.0.0/ucd/auxiliary/WordBreakProperty.txt

91 Proposed Update to UAX #9: The Bidirectional Algorithm
Date/Time: Sat Jul 8 14:19:20 CDT 2006
Contact: [email protected]
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: bidi mirroring and certain quote marks
------------------------------------------------------------------
ALTERNATIVE 1 (preferred)

	make 2018-201F non-bidimirrored (as they were before 5.0.0)
	(as a consequence, remove them from BidiMirroring.txt)
Making them mirrored at this point in time seems needlessly disruptive. There will be a long time while the display of quote marks in bidi texts will be unreliable, and there is a balance on whether to change the quote marks (wich is unlikely to happen, except for texts that are meticulusly maintained). In addition, does the mirrors *HIGH-REVERSED-6 QUOTATION MARK or *LOW-REVERSED-9 QUOTATION MARK at all exist in (bidi) typeset texts? If not, that is further evidence NOT to mirror these quote marks. Furthermore, WITH the mirroring (and without allocating new characters), it is going to be very difficult to get the non-mirrored variety LEFT SINGLE/DOUBLE QUOTATION MARK and SINGLE/DOUBLE LOW-9 QUOTATION MARK (and at least the latter is used, non-mirrored AFAIK, in Hebrew). Getting them would in bidi texts will in practice require the use of the RLO bidi control.

-----------------------------------------------------------------
ALTERNATIVE 2 (must be done in case 2018-201F are kept as BidiMirrored)

	fix the data in BidiMirroring.txt

> 2018; 2019 # [BEST FIT] LEFT SINGLE QUOTATION MARK
	2018 does not have its mirror encoded as a character, 2019 is not
	an approximate mirror, but a SINGLE HIGH-REVERSED-6 QUOTATION MARK
	would be an exact mirror.

> 2019; 2018 # [BEST FIT] RIGHT SINGLE QUOTATION MARK
Change to:
  2019; 201B # RIGHT SINGLE QUOTATION MARK---------------------exact fit
  201B; 2019 # SINGLE HIGH-REVERSED-9 QUOTATION MARK-----------exact fit

> 201C; 201D # [BEST FIT] LEFT DOUBLE QUOTATION MARK
	201C does not have its mirror encoded as a character, 201D is not
	an approximate mirror, but a DOUBLE HIGH-REVERSED-6 QUOTATION MARK
	would be an exact mirror.

> 201D; 201C # [BEST FIT] RIGHT DOUBLE QUOTATION MARK
Change to:
  201D; 201F # RIGHT DOUBLE QUOTATION MARK---------------------exact fit
  201F; 201D # DOUBLE HIGH-REVERSED-9 QUOTATION MARK-----------exact fit

> # 201A; SINGLE LOW-9 QUOTATION MARK
	a SINGLE LOW-REVERSED-9 QUOTATION MARK would be a mirror character

> # 201B; SINGLE HIGH-REVERSED-9 QUOTATION MARK
	201B does have a mirror character, see above

> # 201E; DOUBLE LOW-9 QUOTATION MARK
	a DOUBLE LOW-REVERSED-9 QUOTATION MARK would be a mirror character

> # 201F; DOUBLE HIGH-REVERSED-9 QUOTATION MARK
	201F does have a mirror character, see above
As noted in alternative 1, keeping these quote marks as mirrored may imlpy the need to allocate the four (presently unencoded) characters mentioned above.
93 Representation of Malayalam /au/ Vowel in Traditional and Reformed Orthography
Date/Time: Wed May 17 12:29:24 CDT 2006
Contact: [email protected]
Name: Vinod Balakrishnan
Report Type: Feedback on an Encoding Proposal
Opt Subject: pr-93
The option B looks ideal from http://www.unicode.org/review/pr-93.html
Understanding the backward compatibility, can Unicode consortium mention that 0D4C is deprecated ? or remove it some time in the future ? This can solve some of the people's confusion about two spelling for same vowel sign.
-Vinod

L2/06-274

Comments on Public Review Issues (May 13, 2006 - August 3, 2006)

Contents:

75 Proposed Update UTR #25, Unicode Support for Mathematics

83 Changing Glyph for U+047C/U+047D Cyrillic Omega with Titlo

90 Unicode 5.0 Beta 2

91 Proposed Update to UAX #9: The Bidirectional Algorithm

93 Representation of Malayalam /au/ Vowel in Traditional and Reformed Orthography

Comments on Public Review Issues
(May 13, 2006 - August 3, 2006)