L2/06-202

Date/Time:    Sun May 14 15:59:48 CDT 2006
Contact:      kent.karlsson@streamserve.com
Name:         Kent Karlsson
Report Type:  Public Review Issue
Opt Subject:  90 Unicode 5.0 Beta 2 


----------------------------------------------------

Unibook app:
============

The Unicode 5.0.0 beta app does not list any of the "Hangul Syllables"
as having canonical decompositions, which they have...

---------------------------------------------------

UNICODEDATA
===========

All arrows should have the Sm general category. So should all chars in
Misc. Math Symbols-A and -B. (This implies having the "Math" derived property.)

Since 2320 and 2321 are mirrored, should not also 23B2 and 23B3 be mirrored?
I don't like any of these to be formally "mirrored", but I like consistency
anyhow.

The "Supplemental punctuation" block characters has surprising/mixed "mirrored"
properties.

FD3E, FD3F: since you dare change the mirroring property for the commonly used
quotation marks, I don't see why these two rarely used characters can't be made
mirrored. I would do the opposite to what you've done: mirror those two, but no
change in the common quotation marks w.r.t. mirroring. Note that FD3E/FD3F are
of an open/close nature, while that does not hold for the quotation marks, the
use of which are language dependent, and have then no regular open/close reading.

-------------------------------------------------------------

LINEBREAK
=========

0E2F and 0EAF should both have the BA, break after, linebreak property.

1A1F;AL # BUGINESE END OF SECTION
It seems strange that an "end of section" has lb prop AL (but I don't know
for sure that it is wrong) instead of BA. 

Does Tagalog/Hanunoo/Buhid use space between words? No, I still
very much dislike SA (as any different from AL), but again, I like
consistency.

None of the combining characters should have gotten the SA property.

These should have lb EX:
203C;NS # DOUBLE EXCLAMATION MARK
203D;NS # INTERROBANG
2047;NS # DOUBLE QUESTION MARK
2048;NS # QUESTION EXCLAMATION MARK
2049;NS # EXCLAMATION QUESTION MARK

On the other hand NS and EX are very similar, and maybe should be merged.

These should be OP:
00A1;AI # INVERTED EXCLAMATION MARK
00BF;AI # INVERTED QUESTION MARK

Not sure why these have lb EX, instead of IS or PO:
060C;EX # ARABIC COMMA
061B;EX # ARABIC SEMICOLON
061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK
066A;EX # ARABIC PERCENT SIGN
06D4;EX # ARABIC FULL STOP

Commas in general have a strange mixture of lb property settings:
055D;AL # ARMENIAN COMMA
060C;EX # ARABIC COMMA
07F8;IS # NKO COMMA
1363;AL # ETHIOPIC COMMA
1802;BA # MONGOLIAN COMMA
1808;BA # MONGOLIAN MANCHU COMMA
3001;CL # IDEOGRAPHIC COMMA
FE10;IS # PRESENTATION FORM FOR VERTICAL COMMA
FE11;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
FE50;CL # SMALL COMMA
FE51;ID # SMALL IDEOGRAPHIC COMMA
FF0C;CL # FULLWIDTH COMMA
FF64;CL # HALFWIDTH IDEOGRAPHIC COMMA

So do full stops, some are even AL which I find particularly surprising:
002E;IS # FULL STOP
0589;IS # ARMENIAN FULL STOP
06D4;EX # ARABIC FULL STOP
0701;AL # SYRIAC SUPRALINEAR FULL STOP
0702;AL # SYRIAC SUBLINEAR FULL STOP
1362;AL # ETHIOPIC FULL STOP
166E;AL # CANADIAN SYLLABICS FULL STOP
1803;BA # MONGOLIAN FULL STOP
1809;BA # MONGOLIAN MANCHU FULL STOP
2CF9;BA # COPTIC OLD NUBIAN FULL STOP
2CFE;BA # COPTIC FULL STOP
3002;CL # IDEOGRAPHIC FULL STOP
FE12;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP
FE52;CL # SMALL FULL STOP
FF0E;CL # FULLWIDTH FULL STOP
FF61;CL # HALFWIDTH IDEOGRAPHIC FULL STOP

And semicolons (but I don't know what reversed semicolon is used for):
003B;IS # SEMICOLON
061B;EX # ARABIC SEMICOLON
1364;AL # ETHIOPIC SEMICOLON
204F;AL # REVERSED SEMICOLON
FE14;IS # PRESENTATION FORM FOR VERTICAL SEMICOLON
FE54;NS # SMALL SEMICOLON
FF1B;NS # FULLWIDTH SEMICOLON

Control characters (good that VT got lb prop BK):
NBH (0083) should have the lb value WJ, like WJ and ZWNBSP.
BPH (0082) should have the lb value ZW, like ZWSP.

NL and BK are the same, so there's no need for two lb values.
So I suggest merging NL and BK to just BK, i.e. let NEL have BK.

---------------------------------------------------------------

NAMESLIST and charts:
=====================

The odd text "* distinguish from the following" should be removed from
the names-list at "0304	COMBINING MACRON" (that text is not given at
many other similar places).

-----

The chart glyphs for 0340 and 0341 are different from their canonical
equivalents, which they should not be.

------

"Combining diacritical marks" headings:
	Ordinary diacritics: remove heading
	Additions: remove heading
	Vietnamese...: keep heading (only due to the "deprecated")
	Additions for Greek: -> "Additions" (just to terminate the "deprecated" section)
	Additions for IPA: remove heading

-----

G, K, L, N, R w. comma below (Latvian & Livonian)
=================================================

These appear in the NameList like:
0157	LATIN SMALL LETTER R WITH CEDILLA
	* Latvian
	: 0072 0327

This is problematic since 0327 is COMBINING CEDILLA, while the
chart glyphs actually have a comma below. As for comma below,
NamesList has the entry
0326	COMBINING COMMA BELOW
	* Romanian, Latvian, Livonian

Note that Latvian (and Livonian) is listed there.

Does this mean that G/K/L/N/R are encoded for 6937 compatibility only,
and that the preferred encoding is <R, COMBINING COMMA BELOW>, etc.?
In any case, with the current chart glyphs (which are often emulated
in fonts) there too little (or no) glyph distinction between
<letter WITH CEDILLA>, <letter, COMBINING CEDILLA BELOW>, and
<letter, COMBINING COMMA BELOW>.

The chart glyphs for the <G/K/L/N/R WITH CEDILLA> should each get a
cedilla (not a comma below), and they should get NamesList entries like:
0157	LATIN SMALL LETTER R WITH CEDILLA
	* encoded for compatibily with ISO/IEC 6937
	* the preferred representation for use in Latvian and Livonian is: 0072 0326
	: 0072 0327

[for R/r the comment should refer just to Livonian, not Latvian]

....since COMMA BELOW is the actually used glyphs in Latvian/Livonian, the
encoding should correspond to that. <G/K/L/N/R WITH COMMA BELOW> should
perhaps also be listed as named sequences.

In
0123	LATIN SMALL LETTER G WITH CEDILLA
	* Latvian
	* there are three major glyph variants

The comment about three glyph varaiants is fuzzy; which three are they?
The one I know about is displaying the comma below (so this comment really
applies to <g, COMBINING COMMA BELOW>, not 0123) as a turned comma above.

------

ENG...

014A	LATIN CAPITAL LETTER ENG (Sami)
	* glyph may also have appearance of large form of the small letter

I think that comment applies to some African languages, not Sami (but I'm not
entirely sure). In addition, similar differences have lead to separate representation
of some other letters (e.g. eth vs. d with stroke; Icelandic vs. Sami).

------

After the last line of 
	= latin small letter apostrophe n (1.0)
	* Afrikaans
	* this is not actually a single letter
add (after perhaps deleting "this is not actually a single letter")
	* legacy compatibility character for ISO/IEC 6937
	* preferred representation: 2019 006E

Note that 'n (here using the ACSII fallback) is short for "en" (IIRC,
"an", "one"), and 't is short for "het" (the) in Duch (and Limburgian).
Not sure what 's is short for..., but it does occur in Dutch.

----

 	= mho
->	= sometimes called mho (reversal of the letters in ohm)

----

> 	= German Mark currency symbol, before WWII
I'd prefer a year number, or expand the abbreviation.

----

> 	* Mongolian, Chinese, Tibetan, Sanskrit
and similar. These repeated comments are tedious; does that detail really need
to be noted in the charts? It's not done in similar cases for other scripts.

-----------------------------------------------

Joining data for Phags-Pa is missing. [I know it is missing for Mongolian,
but I've complained about that before to no effect.]

-------------------------------------------------