L2/06-202 Date/Time: Sun May 14 15:59:48 CDT 2006 Contact: kent.karlsson@streamserve.com Name: Kent Karlsson Report Type: Public Review Issue Opt Subject: 90 Unicode 5.0 Beta 2 ---------------------------------------------------- Unibook app: ============ The Unicode 5.0.0 beta app does not list any of the "Hangul Syllables" as having canonical decompositions, which they have... --------------------------------------------------- UNICODEDATA =========== All arrows should have the Sm general category. So should all chars in Misc. Math Symbols-A and -B. (This implies having the "Math" derived property.) Since 2320 and 2321 are mirrored, should not also 23B2 and 23B3 be mirrored? I don't like any of these to be formally "mirrored", but I like consistency anyhow. The "Supplemental punctuation" block characters has surprising/mixed "mirrored" properties. FD3E, FD3F: since you dare change the mirroring property for the commonly used quotation marks, I don't see why these two rarely used characters can't be made mirrored. I would do the opposite to what you've done: mirror those two, but no change in the common quotation marks w.r.t. mirroring. Note that FD3E/FD3F are of an open/close nature, while that does not hold for the quotation marks, the use of which are language dependent, and have then no regular open/close reading. ------------------------------------------------------------- LINEBREAK ========= 0E2F and 0EAF should both have the BA, break after, linebreak property. 1A1F;AL # BUGINESE END OF SECTION It seems strange that an "end of section" has lb prop AL (but I don't know for sure that it is wrong) instead of BA. Does Tagalog/Hanunoo/Buhid use space between words? No, I still very much dislike SA (as any different from AL), but again, I like consistency. None of the combining characters should have gotten the SA property. These should have lb EX: 203C;NS # DOUBLE EXCLAMATION MARK 203D;NS # INTERROBANG 2047;NS # DOUBLE QUESTION MARK 2048;NS # QUESTION EXCLAMATION MARK 2049;NS # EXCLAMATION QUESTION MARK On the other hand NS and EX are very similar, and maybe should be merged. These should be OP: 00A1;AI # INVERTED EXCLAMATION MARK 00BF;AI # INVERTED QUESTION MARK Not sure why these have lb EX, instead of IS or PO: 060C;EX # ARABIC COMMA 061B;EX # ARABIC SEMICOLON 061E;EX # ARABIC TRIPLE DOT PUNCTUATION MARK 066A;EX # ARABIC PERCENT SIGN 06D4;EX # ARABIC FULL STOP Commas in general have a strange mixture of lb property settings: 055D;AL # ARMENIAN COMMA 060C;EX # ARABIC COMMA 07F8;IS # NKO COMMA 1363;AL # ETHIOPIC COMMA 1802;BA # MONGOLIAN COMMA 1808;BA # MONGOLIAN MANCHU COMMA 3001;CL # IDEOGRAPHIC COMMA FE10;IS # PRESENTATION FORM FOR VERTICAL COMMA FE11;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA FE50;CL # SMALL COMMA FE51;ID # SMALL IDEOGRAPHIC COMMA FF0C;CL # FULLWIDTH COMMA FF64;CL # HALFWIDTH IDEOGRAPHIC COMMA So do full stops, some are even AL which I find particularly surprising: 002E;IS # FULL STOP 0589;IS # ARMENIAN FULL STOP 06D4;EX # ARABIC FULL STOP 0701;AL # SYRIAC SUPRALINEAR FULL STOP 0702;AL # SYRIAC SUBLINEAR FULL STOP 1362;AL # ETHIOPIC FULL STOP 166E;AL # CANADIAN SYLLABICS FULL STOP 1803;BA # MONGOLIAN FULL STOP 1809;BA # MONGOLIAN MANCHU FULL STOP 2CF9;BA # COPTIC OLD NUBIAN FULL STOP 2CFE;BA # COPTIC FULL STOP 3002;CL # IDEOGRAPHIC FULL STOP FE12;CL # PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP FE52;CL # SMALL FULL STOP FF0E;CL # FULLWIDTH FULL STOP FF61;CL # HALFWIDTH IDEOGRAPHIC FULL STOP And semicolons (but I don't know what reversed semicolon is used for): 003B;IS # SEMICOLON 061B;EX # ARABIC SEMICOLON 1364;AL # ETHIOPIC SEMICOLON 204F;AL # REVERSED SEMICOLON FE14;IS # PRESENTATION FORM FOR VERTICAL SEMICOLON FE54;NS # SMALL SEMICOLON FF1B;NS # FULLWIDTH SEMICOLON Control characters (good that VT got lb prop BK): NBH (0083) should have the lb value WJ, like WJ and ZWNBSP. BPH (0082) should have the lb value ZW, like ZWSP. NL and BK are the same, so there's no need for two lb values. So I suggest merging NL and BK to just BK, i.e. let NEL have BK. --------------------------------------------------------------- NAMESLIST and charts: ===================== The odd text "* distinguish from the following" should be removed from the names-list at "0304 COMBINING MACRON" (that text is not given at many other similar places). ----- The chart glyphs for 0340 and 0341 are different from their canonical equivalents, which they should not be. ------ "Combining diacritical marks" headings: Ordinary diacritics: remove heading Additions: remove heading Vietnamese...: keep heading (only due to the "deprecated") Additions for Greek: -> "Additions" (just to terminate the "deprecated" section) Additions for IPA: remove heading ----- G, K, L, N, R w. comma below (Latvian & Livonian) ================================================= These appear in the NameList like: 0157 LATIN SMALL LETTER R WITH CEDILLA * Latvian : 0072 0327 This is problematic since 0327 is COMBINING CEDILLA, while the chart glyphs actually have a comma below. As for comma below, NamesList has the entry 0326 COMBINING COMMA BELOW * Romanian, Latvian, Livonian Note that Latvian (and Livonian) is listed there. Does this mean that G/K/L/N/R are encoded for 6937 compatibility only, and that the preferred encoding is , etc.? In any case, with the current chart glyphs (which are often emulated in fonts) there too little (or no) glyph distinction between , , and . The chart glyphs for the should each get a cedilla (not a comma below), and they should get NamesList entries like: 0157 LATIN SMALL LETTER R WITH CEDILLA * encoded for compatibily with ISO/IEC 6937 * the preferred representation for use in Latvian and Livonian is: 0072 0326 : 0072 0327 [for R/r the comment should refer just to Livonian, not Latvian] ....since COMMA BELOW is the actually used glyphs in Latvian/Livonian, the encoding should correspond to that. should perhaps also be listed as named sequences. In 0123 LATIN SMALL LETTER G WITH CEDILLA * Latvian * there are three major glyph variants The comment about three glyph varaiants is fuzzy; which three are they? The one I know about is displaying the comma below (so this comment really applies to , not 0123) as a turned comma above. ------ ENG... 014A LATIN CAPITAL LETTER ENG (Sami) * glyph may also have appearance of large form of the small letter I think that comment applies to some African languages, not Sami (but I'm not entirely sure). In addition, similar differences have lead to separate representation of some other letters (e.g. eth vs. d with stroke; Icelandic vs. Sami). ------ After the last line of = latin small letter apostrophe n (1.0) * Afrikaans * this is not actually a single letter add (after perhaps deleting "this is not actually a single letter") * legacy compatibility character for ISO/IEC 6937 * preferred representation: 2019 006E Note that 'n (here using the ACSII fallback) is short for "en" (IIRC, "an", "one"), and 't is short for "het" (the) in Duch (and Limburgian). Not sure what 's is short for..., but it does occur in Dutch. ---- = mho -> = sometimes called mho (reversal of the letters in ohm) ---- > = German Mark currency symbol, before WWII I'd prefer a year number, or expand the abbreviation. ---- > * Mongolian, Chinese, Tibetan, Sanskrit and similar. These repeated comments are tedious; does that detail really need to be noted in the charts? It's not done in similar cases for other scripts. ----------------------------------------------- Joining data for Phags-Pa is missing. [I know it is missing for Mongolian, but I've complained about that before to no effect.] -------------------------------------------------