UTC/1999-017 From: mark.davis@us.ibm.com Sent: Wednesday, June 02, 1999 5:23 PM To: Multiple Recipients of Unicore Subject: Data cross-checks (for Agenda) I run a series of tools over the Unicode database to check various consistency conditions. Some examples of these checks are: - are canonical decompositions in canonical order? - do lowercase letters have uppercase mappings? - if you take a lowercase letter, then uppercase it, then lowercase it, do you get the same thing back? - are various correspondences between BIDI properties and General Category properties preserved? etc. Some of these conditions are violated by the current database. In many cases, those violations are well understood, and generally agreed upon (exact assignment of properties and behavior is often a balancing act among a number of different factors). In others, they may just be anomalies that have slipped through the cracks over the years. Unicode 3.0 will be an important milestone in terms of data consistency, so I want to bring some of these issues up at the meeting. We may end up not deciding to change some of them, but we should at least discuss them before finalizing the data. 1. Mu Kent already submitted a document about this, with some good information from Ken in response, so I won't discuss that here. 2. Georgian When you look at the data file, you find the following oddity (repeated for all the Georgian letters) Letter 10A0 is uppercase, and has letter 10D0 as its lowercase. Letter 10D0 is lowercase, but does not have any uppercase. (From the data file: 10A0;GEORGIAN CAPITAL LETTER AN;Lu;0;L;;;;;N;;Khutsuri;;10D0; 10D0;GEORGIAN LETTER AN;Ll;0;L;;;;;N;GEORGIAN SMALL LETTER AN;;;;) This is because what is called "CAPITAL", is really an archaic form; Modern Georgian is caseless. However, the assignment of properties does not really recognize that, and leads to problems in practice. - Since modern Georgian is caseless, it is more important to recognize this property then to recognize (just part of) archaic practice. - Common practice does a caseless match with the following condition: toUppercase(a) == toUppercase(b). With Georgian, you get the following failure: b = toLowercase(a); if (toUppercase(b) == toUppercase(a)) { // you'd think this must be true, but it is not. } Recognizing that it should be more important to handle modern Georgian correctly than to handle both suboptimally, we should: a. change the property of 10D0 to Lo (from Ll) b. change the case mapping of 10A0 from 10D0 to be no mapping. c. IF we want to recognize the lowercase and uppercase mappings for archaic contexts, we can to that by adding them to the SpecialCasing file with a condition tag "ARCHAIC". 3. Missing Titlecase The titlecase is missing from the following two characters, although we do have the uppercase. They need to be added. Old: 0345;COMBINING GREEK YPOGEGRAMMENI;Mn;240;NSM;;;;;N;GREEK NON-SPACING IOTA BELOW;;0399;; New: 0345;COMBINING GREEK YPOGEGRAMMENI;Mn;240;NSM;;;;;N;GREEK NON-SPACING IOTA BELOW;;0399;;0399 Old: 1FBE;GREEK PROSGEGRAMMENI;Ll;0;L;03B9;;;;N;;;0399;; New: 1FBE;GREEK PROSGEGRAMMENI;Ll;0;L;03B9;;;;N;;;0399;;0399 4. GREEK SMALL LETTERS WITH YPOGEGRAMMENI We have gotten several bug reports on this. We have agreed on the treatment of iota-subscript with casing, but we still have one remaining problem in the database. Letters like 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lu;0;L;1F08 0345;;;;N;;;;1F80; are marked as uppercase, when by definition they are really titlecase. This affects the following: 1F88, GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI ..1F8F, GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1F98, GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI ..1F9F, GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1FA8, GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI ..1FAF, GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1FBC, GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI 1FCC, GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI 1FFC, GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI 5. Uncased characters that have case mappings The following characters are not marked as cased characters, but do have a case mapping. In this case, it is probably ok, but I mention it here for comparison. (The format here is not the standard database format: C (general category), L (lower), T (title), U (upper) are used for clarity.) 2160; C: Nl; L: 2170; T: 2160; U: 2160; ROMAN NUMERAL ONE ..216F; C: Nl; L: 217F; T: 216F; U: 216F; ROMAN NUMERAL ONE THOUSAND 2170; C: Nl; L: 2170; T: 2160; U: 2160; SMALL ROMAN NUMERAL ONE ..217F; C: Nl; L: 217F; T: 216F; U: 216F; SMALL ROMAN NUMERAL ONE THOUSAND 24B6; C: So; L: 24D0; T: 24B6; U: 24B6; CIRCLED LATIN CAPITAL LETTER A ..24CF; C: So; L: 24E9; T: 24CF; U: 24CF; CIRCLED LATIN CAPITAL LETTER Z 24D0; C: So; L: 24D0; T: 24B6; U: 24B6; CIRCLED LATIN SMALL LETTER A ..24E9; C: So; L: 24E9; T: 24CF; U: 24CF; CIRCLED LATIN SMALL LETTER Z 6. There are a couple of minor BIDI property issues. I will pass those by the bidi crowd first. 7. Cased characters that are missing case mappings. The following characters are marked as having case, but do not have case mappings. If we anticipate perhaps sometime in the future adding case mappings, they probably should be left alone. However, what of characters like 00AA or the KAI symbol? I am not making any recommendations, but people should look this over to see if they spot anything that looks odd to them. (The format here is not the standard database format: C (general category), L (lower), T (title), U (upper) are used for clarity.) 00AA; C: Ll; L: 00AA; T: 00AA; U: 00AA; FEMININE ORDINAL INDICATOR 00B5; C: Ll; L: 00B5; T: 00B5; U: 00B5; MICRO SIGN 00BA; C: Ll; L: 00BA; T: 00BA; U: 00BA; MASCULINE ORDINAL INDICATOR 0138; C: Ll; L: 0138; T: 0138; U: 0138; LATIN SMALL LETTER KRA 0180; C: Ll; L: 0180; T: 0180; U: 0180; LATIN SMALL LETTER B WITH STROKE 018D; C: Ll; L: 018D; T: 018D; U: 018D; LATIN SMALL LETTER TURNED DELTA 019A; C: Ll; L: 019A; T: 019A; U: 019A; LATIN SMALL LETTER L WITH BAR 019B; C: Ll; L: 019B; T: 019B; U: 019B; LATIN SMALL LETTER LAMBDA WITH STROKE 019E; C: Ll; L: 019E; T: 019E; U: 019E; LATIN SMALL LETTER N WITH LONG RIGHT LEG 01AB; C: Ll; L: 01AB; T: 01AB; U: 01AB; LATIN SMALL LETTER T WITH PALATAL HOOK 01BA; C: Ll; L: 01BA; T: 01BA; U: 01BA; LATIN SMALL LETTER EZH WITH TAIL 0250; C: Ll; L: 0250; T: 0250; U: 0250; LATIN SMALL LETTER TURNED A 0251; C: Ll; L: 0251; T: 0251; U: 0251; LATIN SMALL LETTER ALPHA 0252; C: Ll; L: 0252; T: 0252; U: 0252; LATIN SMALL LETTER TURNED ALPHA 0255; C: Ll; L: 0255; T: 0255; U: 0255; LATIN SMALL LETTER C WITH CURL 0258; C: Ll; L: 0258; T: 0258; U: 0258; LATIN SMALL LETTER REVERSED E 025A; C: Ll; L: 025A; T: 025A; U: 025A; LATIN SMALL LETTER SCHWA WITH HOOK 025C; C: Ll; L: 025C; T: 025C; U: 025C; LATIN SMALL LETTER REVERSED OPEN E 025D; C: Ll; L: 025D; T: 025D; U: 025D; LATIN SMALL LETTER REVERSED OPEN E WITH HOOK 025E; C: Ll; L: 025E; T: 025E; U: 025E; LATIN SMALL LETTER CLOSED REVERSED OPEN E 025F; C: Ll; L: 025F; T: 025F; U: 025F; LATIN SMALL LETTER DOTLESS J WITH STROKE 0261; C: Ll; L: 0261; T: 0261; U: 0261; LATIN SMALL LETTER SCRIPT G 0262; C: Ll; L: 0262; T: 0262; U: 0262; LATIN LETTER SMALL CAPITAL G 0264; C: Ll; L: 0264; T: 0264; U: 0264; LATIN SMALL LETTER RAMS HORN 0265; C: Ll; L: 0265; T: 0265; U: 0265; LATIN SMALL LETTER TURNED H 0266; C: Ll; L: 0266; T: 0266; U: 0266; LATIN SMALL LETTER H WITH HOOK 0267; C: Ll; L: 0267; T: 0267; U: 0267; LATIN SMALL LETTER HENG WITH HOOK 026A; C: Ll; L: 026A; T: 026A; U: 026A; LATIN LETTER SMALL CAPITAL I 026B; C: Ll; L: 026B; T: 026B; U: 026B; LATIN SMALL LETTER L WITH MIDDLE TILDE 026C; C: Ll; L: 026C; T: 026C; U: 026C; LATIN SMALL LETTER L WITH BELT 026D; C: Ll; L: 026D; T: 026D; U: 026D; LATIN SMALL LETTER L WITH RETROFLEX HOOK 026E; C: Ll; L: 026E; T: 026E; U: 026E; LATIN SMALL LETTER LEZH 0270; C: Ll; L: 0270; T: 0270; U: 0270; LATIN SMALL LETTER TURNED M WITH LONG LEG 0271; C: Ll; L: 0271; T: 0271; U: 0271; LATIN SMALL LETTER M WITH HOOK 0273; C: Ll; L: 0273; T: 0273; U: 0273; LATIN SMALL LETTER N WITH RETROFLEX HOOK 0274; C: Ll; L: 0274; T: 0274; U: 0274; LATIN LETTER SMALL CAPITAL N 0276; C: Ll; L: 0276; T: 0276; U: 0276; LATIN LETTER SMALL CAPITAL OE 0277; C: Ll; L: 0277; T: 0277; U: 0277; LATIN SMALL LETTER CLOSED OMEGA 0278; C: Ll; L: 0278; T: 0278; U: 0278; LATIN SMALL LETTER PHI 0279; C: Ll; L: 0279; T: 0279; U: 0279; LATIN SMALL LETTER TURNED R 027A; C: Ll; L: 027A; T: 027A; U: 027A; LATIN SMALL LETTER TURNED R WITH LONG LEG 027B; C: Ll; L: 027B; T: 027B; U: 027B; LATIN SMALL LETTER TURNED R WITH HOOK 027C; C: Ll; L: 027C; T: 027C; U: 027C; LATIN SMALL LETTER R WITH LONG LEG 027D; C: Ll; L: 027D; T: 027D; U: 027D; LATIN SMALL LETTER R WITH TAIL 027E; C: Ll; L: 027E; T: 027E; U: 027E; LATIN SMALL LETTER R WITH FISHHOOK 027F; C: Ll; L: 027F; T: 027F; U: 027F; LATIN SMALL LETTER REVERSED R WITH FISHHOOK 0281; C: Ll; L: 0281; T: 0281; U: 0281; LATIN LETTER SMALL CAPITAL INVERTED R 0282; C: Ll; L: 0282; T: 0282; U: 0282; LATIN SMALL LETTER S WITH HOOK 0284; C: Ll; L: 0284; T: 0284; U: 0284; LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK 0285; C: Ll; L: 0285; T: 0285; U: 0285; LATIN SMALL LETTER SQUAT REVERSED ESH 0286; C: Ll; L: 0286; T: 0286; U: 0286; LATIN SMALL LETTER ESH WITH CURL 0287; C: Ll; L: 0287; T: 0287; U: 0287; LATIN SMALL LETTER TURNED T 0289; C: Ll; L: 0289; T: 0289; U: 0289; LATIN SMALL LETTER U BAR 028C; C: Ll; L: 028C; T: 028C; U: 028C; LATIN SMALL LETTER TURNED V 028D; C: Ll; L: 028D; T: 028D; U: 028D; LATIN SMALL LETTER TURNED W 028E; C: Ll; L: 028E; T: 028E; U: 028E; LATIN SMALL LETTER TURNED Y 028F; C: Ll; L: 028F; T: 028F; U: 028F; LATIN LETTER SMALL CAPITAL Y 0290; C: Ll; L: 0290; T: 0290; U: 0290; LATIN SMALL LETTER Z WITH RETROFLEX HOOK 0291; C: Ll; L: 0291; T: 0291; U: 0291; LATIN SMALL LETTER Z WITH CURL 0293; C: Ll; L: 0293; T: 0293; U: 0293; LATIN SMALL LETTER EZH WITH CURL 0294; C: Ll; L: 0294; T: 0294; U: 0294; LATIN LETTER GLOTTAL STOP 0295; C: Ll; L: 0295; T: 0295; U: 0295; LATIN LETTER PHARYNGEAL VOICED FRICATIVE 0296; C: Ll; L: 0296; T: 0296; U: 0296; LATIN LETTER INVERTED GLOTTAL STOP 0297; C: Ll; L: 0297; T: 0297; U: 0297; LATIN LETTER STRETCHED C 0298; C: Ll; L: 0298; T: 0298; U: 0298; LATIN LETTER BILABIAL CLICK 0299; C: Ll; L: 0299; T: 0299; U: 0299; LATIN LETTER SMALL CAPITAL B 029A; C: Ll; L: 029A; T: 029A; U: 029A; LATIN SMALL LETTER CLOSED OPEN E 029B; C: Ll; L: 029B; T: 029B; U: 029B; LATIN LETTER SMALL CAPITAL G WITH HOOK 029C; C: Ll; L: 029C; T: 029C; U: 029C; LATIN LETTER SMALL CAPITAL H 029D; C: Ll; L: 029D; T: 029D; U: 029D; LATIN SMALL LETTER J WITH CROSSED-TAIL 029E; C: Ll; L: 029E; T: 029E; U: 029E; LATIN SMALL LETTER TURNED K 029F; C: Ll; L: 029F; T: 029F; U: 029F; LATIN LETTER SMALL CAPITAL L 02A0; C: Ll; L: 02A0; T: 02A0; U: 02A0; LATIN SMALL LETTER Q WITH HOOK 02A1; C: Ll; L: 02A1; T: 02A1; U: 02A1; LATIN LETTER GLOTTAL STOP WITH STROKE 02A2; C: Ll; L: 02A2; T: 02A2; U: 02A2; LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE 02A3; C: Ll; L: 02A3; T: 02A3; U: 02A3; LATIN SMALL LETTER DZ DIGRAPH 02A4; C: Ll; L: 02A4; T: 02A4; U: 02A4; LATIN SMALL LETTER DEZH DIGRAPH 02A5; C: Ll; L: 02A5; T: 02A5; U: 02A5; LATIN SMALL LETTER DZ DIGRAPH WITH CURL 02A6; C: Ll; L: 02A6; T: 02A6; U: 02A6; LATIN SMALL LETTER TS DIGRAPH 02A7; C: Ll; L: 02A7; T: 02A7; U: 02A7; LATIN SMALL LETTER TESH DIGRAPH 02A8; C: Ll; L: 02A8; T: 02A8; U: 02A8; LATIN SMALL LETTER TC DIGRAPH WITH CURL 02A9; C: Ll; L: 02A9; T: 02A9; U: 02A9; LATIN SMALL LETTER FENG DIGRAPH 02AA; C: Ll; L: 02AA; T: 02AA; U: 02AA; LATIN SMALL LETTER LS DIGRAPH 02AB; C: Ll; L: 02AB; T: 02AB; U: 02AB; LATIN SMALL LETTER LZ DIGRAPH 02AC; C: Ll; L: 02AC; T: 02AC; U: 02AC; LATIN LETTER BILABIAL PERCUSSIVE 02AD; C: Ll; L: 02AD; T: 02AD; U: 02AD; LATIN LETTER BIDENTAL PERCUSSIVE 03D2; C: Lu; L: 03D2; T: 03D2; U: 03D2; GREEK UPSILON WITH HOOK SYMBOL 03D3; C: Lu; L: 03D3; T: 03D3; U: 03D3; GREEK UPSILON WITH ACUTE AND HOOK SYMBOL 03D4; C: Lu; L: 03D4; T: 03D4; U: 03D4; GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL 03D7; C: Ll; L: 03D7; T: 03D7; U: 03D7; GREEK KAI SYMBOL 207F; C: Ll; L: 207F; T: 207F; U: 207F; SUPERSCRIPT LATIN SMALL LETTER N 2102; C: Lu; L: 2102; T: 2102; U: 2102; DOUBLE-STRUCK CAPITAL C 2107; C: Lu; L: 2107; T: 2107; U: 2107; EULER CONSTANT 210A; C: Ll; L: 210A; T: 210A; U: 210A; SCRIPT SMALL G 210B; C: Lu; L: 210B; T: 210B; U: 210B; SCRIPT CAPITAL H 210C; C: Lu; L: 210C; T: 210C; U: 210C; BLACK-LETTER CAPITAL H 210D; C: Lu; L: 210D; T: 210D; U: 210D; DOUBLE-STRUCK CAPITAL H 210E; C: Ll; L: 210E; T: 210E; U: 210E; PLANCK CONSTANT 210F; C: Ll; L: 210F; T: 210F; U: 210F; PLANCK CONSTANT OVER TWO PI 2110; C: Lu; L: 2110; T: 2110; U: 2110; SCRIPT CAPITAL I 2111; C: Lu; L: 2111; T: 2111; U: 2111; BLACK-LETTER CAPITAL I 2112; C: Lu; L: 2112; T: 2112; U: 2112; SCRIPT CAPITAL L 2113; C: Ll; L: 2113; T: 2113; U: 2113; SCRIPT SMALL L 2115; C: Lu; L: 2115; T: 2115; U: 2115; DOUBLE-STRUCK CAPITAL N 2118; C: Ll; L: 2118; T: 2118; U: 2118; SCRIPT CAPITAL P 2119; C: Lu; L: 2119; T: 2119; U: 2119; DOUBLE-STRUCK CAPITAL P 211A; C: Lu; L: 211A; T: 211A; U: 211A; DOUBLE-STRUCK CAPITAL Q 211B; C: Lu; L: 211B; T: 211B; U: 211B; SCRIPT CAPITAL R 211C; C: Lu; L: 211C; T: 211C; U: 211C; BLACK-LETTER CAPITAL R 211D; C: Lu; L: 211D; T: 211D; U: 211D; DOUBLE-STRUCK CAPITAL R 2124; C: Lu; L: 2124; T: 2124; U: 2124; DOUBLE-STRUCK CAPITAL Z 2126; C: Lu; L: 2126; T: 2126; U: 2126; OHM SIGN 2128; C: Lu; L: 2128; T: 2128; U: 2128; BLACK-LETTER CAPITAL Z 212A; C: Lu; L: 212A; T: 212A; U: 212A; KELVIN SIGN 212B; C: Lu; L: 212B; T: 212B; U: 212B; ANGSTROM SIGN 212C; C: Lu; L: 212C; T: 212C; U: 212C; SCRIPT CAPITAL B 212D; C: Lu; L: 212D; T: 212D; U: 212D; BLACK-LETTER CAPITAL C 212E; C: Ll; L: 212E; T: 212E; U: 212E; ESTIMATED SYMBOL 212F; C: Ll; L: 212F; T: 212F; U: 212F; SCRIPT SMALL E 2130; C: Lu; L: 2130; T: 2130; U: 2130; SCRIPT CAPITAL E 2131; C: Lu; L: 2131; T: 2131; U: 2131; SCRIPT CAPITAL F 2133; C: Lu; L: 2133; T: 2133; U: 2133; SCRIPT CAPITAL M 2134; C: Ll; L: 2134; T: 2134; U: 2134; SCRIPT SMALL O 2139; C: Ll; L: 2139; T: 2139; U: 2139; INFORMATION SOURCE Mark ___ Mark Davis, IBM Center for Java Technology, Cupertino (408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 5