L2/09-257 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Date/Time: Thu Jul 30 08:28:21 CDT 2009 Contact: emmanuel@vallois.name Name: Emmanuel Vallois Report Type: Public Review Issue Opt Subject: PRI 141, Proposed Update UAX #38: Unicode Han Database (Unihan) Subject: Public Review Issue 141, Proposed Update UAX #38: Unicode Han Database (Unihan) In this review, I will make a few editorial comments, then concentrate on the consistency of the Syntax fields between them, between the Syntax and the Description field, and between the Syntax and actual data in the database. Editorial ¯¯¯¯¯¯¯¯¯ The last paragraph of 2.1 and the second of 2.2 are redundant as both say the database is in UTF-8. I would favor removal of the former. kSemanticVariant, Description: question marks appear following code points instead of the corresponding characters. kPseudoGB1: should the first sentence read "A "GB 12345-90" code point assigned to this character" rather than "A "GB 12345-90" code point assigned this character" ? (well, I admit my English is not perfect, this could be correct) Syntax fields ¯¯¯¯¯¯¯¯¯¯¯¯¯ This is a review on the Syntax field of each property, taking into account if it is consistent with the Descripion field or if the description doesn't say enough about it. General comments: Common potential issues. 1. The period should be escaped with a backslash in the Syntax (i.e. replace . with \.) Applies to kCheungBauerIndex, kCihaiT, kCowles, kDaeJaweon, kDefinition, kFennIndex, kFourCornerCode, kHanYu, kIICore, kIRGDaeJaweon, kIRGHanyuDaZidian, kKangXi, kRSAdobe_Japan1_6, kRSJapanese, kRSKangXi, kRSKanWa, kRSKorean, kRSUnicode, kSBGY. 2. The beginning ^ and ending $ are missing, add them to be consistent with the Syntax field of the other properties. Applies to kHangul, kHanyuPinlu, kHanyuPinyin, kHDZRadBreak, kIRG_GSource, kIRG_USource, kIRG_VSource, kMandarin, kSemanticVariant, kSpecializedSemanticVariant, kTaiwanTelegraph, kVietnamese. 2bis. kFourCornerCode, kXerox are missing the ending $ 3. How about regular expressions and normalization form C ? The second paragraph of 2.2 Unihan.zip states that each file is Normalization Form C (NFC), so whenever possible only precomposed characters are used. The syntax for Properties including text with diacritics (kHanyuPinlu, kHanyuPinyin, kMandarin, kTang, kVietnamese and kXHC1983) therefore have a potential issue. They are shorter and easier to read stated as if the text was in NFD, with letters without diacritics and diacritics listed apart. But this does not reflect excatly what is in the file. So should the expressions remain as they are (some could be simplified anyway), towards simplicity, or we list every possible character and there is a list or about 25+ precomposed characters for kHanyuPinyin, maybe more for kVietnamese. In my comments below, I will consider for kHanyuPinyin, kTang, kVietnamese and kXHC1983 that these are matched against NFD form, as this keeps the regular expressions readable, whereas kHanyuPinlu and kMandarin can be changed with minimal impact to match NFC. 4. Numbers without leading zeros. There is a well defined way to express this fact, and it is well applied in the syntax of kLau and kKarlgren and kMeyerWempe, but not in the syntax of other properties, such as the radical-stroke counts kRS* (except kRSAdobe_Japan1_6). Specific comments: Taking fields in the order in which they appear in UAX #38, i.e. alphabetical order: kCantonese: could be more strict, such as ^[a-z]{2,6}[1-6]$ (pronunciations have a minimum/maximum length) kCheungBauer: Precise that / in the third part of the Syntax is used to separate possible tones, as U+55D7 kCheungBauer 030/10;RBBB;gut6/4/2,gwat1 kDaeJaweon: The Description does not indicates why the Syntax ends with [0158], just saying "the final digit in the position being "0" for characters actually in the dictionary and "1" for characters not found in the dictionary and assigned a "virtual" position in the dictionary." So where do 5 and 8 come from? kDefinition: Syntax should read ^[^\t"]*$, to exclude tabs and double quote, as well as giving a clue that the definition is single-line (no line breaks allowed), as general comment 1. kFenn: The Description does not explain what the small a that could happen after the number is (the a? in the Syntax). kGSR: In the Syntax, the asterisk (*) before the end must be a question mark (?), at most one apostrophe can end the value of the property. kHanyu: The description is more precise than the Syntax, saying that ""XY" is the zero-padded number of the character on the page [01..32]", therefore the regular expression should be ^[1-8][0-9]{4}\.[0-3][0-9][0-3]$ kHanyuPinlu: Given that the Unihan files are in NFC (see General comment 3.), U+0308 cannot happen, and Syntax should be ^[a-z\x{FC}]+[1-5]\([0-9]+\)$ Moreover, while the Delimiter is correct, the Descrption incorrectly says that "the list elements are "comma + space" delimited", it should read "the list elements are space delimited". kHanyuPinyin: Different issues with the syntax here: 1. With the assumption I made in general comment 3., to keep the Syntax self-consistent, remove \x{FC}, because it is composed and can be obtained with u + \x{308}. 2. Independently, the syntax given, even with its missing right parenthesis at the end, is incorrect. With correction of 1. above applied, the regular expression of the Syntax should be ^(\d{5}\.\d{2}0,)*\d{5}\.\d{2}0:([a-z\x{300}-\x{302}\x{304}\x{308}\x{30C}]+,)*[a-z\x{300}-\x{302}\x{304}\x{308}\x{30C}]+$ 3. The presence of U+0302 COMBINING CIRCUMFLEX ACCENT is surprising here, it is not customarily used in pinyin. In the database, only U+6B38 and U+8A92 were found to use it with a double dicritic (U+0302 + U+0304 and U+0302 + U+030C) which also seems strange to me. It could be right or it could be an input error. kHDZRadBreak: The Syntax is too lose and isn't consistent with the description and the content of the database. The code point between square brackets is the code point of the radical, so we can be more precise, and since the character on which the break occurs is in the dictionary, the last digit is always 0. And see my comment on kHanyu above. So the Syntax should be ^[\x{2F00}-\x{2FD5}]\[U\+2F[0-9A-D][0-9A-F]\]:[1-8][0-9]{4}\.[0-3][0-9]0$ kIRGDaeJaweon: The Description does not says why the syntax gives the possibility of this property to be 0000.555 kIRGDaiKanwaZiten: the Syntax could be more precise, indeed the data matches ^[0-9]{5}'?$ kIRG_GSource: more precision can be given to the syntax: • BK and CH are either alone or followed by six decimal digits • FZ could be followed by _BK says the syntax, but I checked this never happens in the current (5.2.0d1) version of the database, so the "(_BK)?" part could be deleted. • KX is always followed by six decimal digits, the first never greater than 1 • The end is missing, the listed standards starting with G are listed without the initial G and followed by a dash and four hexadecimal digits. ^4K|BK|((BK|CH)[0-9]{6})|CH|CY|FZ(_BK)?|HC|([HX]C[0-9]{6})|HZ|(KX[01][0-9]{5})|((JZ|ZJW|ZFY|CYY|HZ|FZ)[0-9]{5})|(H[0-9]{6})|([0135789ES]-[0-9A-F]{4})$ kIRG_KSource: The Syntax mentions: • a K5 source, present in the database, not mentioned in the description • a K7 source, missing in the database. -> alter the Syntax to omit the 7. -> complete the description kIRG_TSource: The sources TC, TD and TE are not documented in the description. kIRG_VSource: The source V4 is not documented in the description. kMandarin: See general comment 3. If the normalization form is taken into account, should replace \x{308} by \x{DC} in the syntax. kMatthews: As I said in general comment 4., the fact that the index is an integer not zero-padded should be expressed as [1-9][0-9]{0,3} -> the resulting expression is ^[1-9][0-9]{0,3}(a|.5)?$ The meaning of the .5 (virtual position or added later?) and the "a" suffix are not documented. kMorohashi: A lot of entries in the database have a value of 99999, I guess meaning that the character is not in the dictionary. So either mention it in the description, or remove all the superfluous entries from the database. kRSAdobe_Japan1_6: Syntax: The plus sign should be escaped with \ as it is a special character for regular expressions, and as noted in general comment 1., the period should also be escaped for the same reason. --> ^[CV]\+[0-9]{1,5}\+[1-9][0-9]{0,2}\.[1-9][0-9]?\.[0-9]{1,2}$ kRSJapanese, kRSKangXi, kRSKanWa, kRSKorean: See General comment 4. (no leading zero). The syntax shoud be ^[1-9][0-9]{0,2}\.[0-9]{1,2}$ The Description is misleading, there is no apostrophe in the database, and neither in the syntax. kRSUnicode: See General comment 4. (no leading zero). The syntax shoud be --> ^[1-9][0-9]{0,2}'?\.[0-9]{1,2}$ kSBGY: See my comment on kHanyu, the Description precises that ""XY" is the zero-padded number of the character on the page [01..73]", so the syntax becomes ^[0-9]{3}.[0-7][0-9]$ kSemanticVariant, kSpecializedSemanticVariant: The Syntax is inexact, it doesn't take well into account the specification of the description concerning the colon. It should be ^U\+2?[0-9A-F]{4}(