L2/09-257

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Date/Time:    Thu Jul 30 08:28:21 CDT 2009
Contact:      emmanuel@vallois.name
Name:         Emmanuel Vallois
Report Type:  Public Review Issue
Opt Subject:  PRI 141, Proposed Update UAX #38: Unicode Han Database (Unihan)

Subject: Public Review Issue 141, Proposed Update UAX #38: Unicode Han Database (Unihan)

In this review, I will make a few editorial comments, then concentrate on the consistency
of the Syntax fields between them, between the Syntax and the Description field, and
between the Syntax and actual data in the database.

Editorial
¯¯¯¯¯¯¯¯¯
The last paragraph of 2.1 and the second of 2.2 are redundant as both say the database
is in UTF-8. I would favor removal of the former. kSemanticVariant, Description:
question marks appear following code points instead of the corresponding characters.
kPseudoGB1: should the first sentence read "A "GB 12345-90" code point assigned to
this character" rather than "A "GB 12345-90" code point assigned this character" ?
(well, I admit my English is not perfect, this could be correct)

Syntax fields
¯¯¯¯¯¯¯¯¯¯¯¯¯
This is a review on the Syntax field of each property, taking into account if it is
consistent with the Descripion field or if the description doesn't say enough about it.

General comments: Common potential issues.
1. The period should be escaped with a backslash in the Syntax (i.e. replace . with \.)
Applies to kCheungBauerIndex, kCihaiT, kCowles, kDaeJaweon, kDefinition, kFennIndex,
kFourCornerCode, kHanYu, kIICore, kIRGDaeJaweon, kIRGHanyuDaZidian, kKangXi,
kRSAdobe_Japan1_6, kRSJapanese, kRSKangXi, kRSKanWa, kRSKorean, kRSUnicode, kSBGY.

2. The beginning ^ and ending $ are missing, add them to be consistent with the Syntax
field of the other properties.
Applies to kHangul, kHanyuPinlu, kHanyuPinyin, kHDZRadBreak, kIRG_GSource, kIRG_USource,
kIRG_VSource, kMandarin, kSemanticVariant, kSpecializedSemanticVariant, kTaiwanTelegraph,
kVietnamese. 2bis. kFourCornerCode, kXerox are missing the ending $

3. How about regular expressions and normalization form C ? The second paragraph of 2.2
Unihan.zip states that each file is Normalization Form C (NFC), so whenever possible only
precomposed characters are used. The syntax for Properties including text with diacritics
(kHanyuPinlu, kHanyuPinyin, kMandarin, kTang, kVietnamese and kXHC1983) therefore have
a potential issue. They are shorter and easier to read stated as if the text was in NFD,
with letters without diacritics and diacritics listed apart. But this does not reflect
excatly what is in the file. So should the expressions remain as they are (some could be
simplified anyway), towards simplicity, or we list every possible character and there is
a list or about 25+ precomposed characters for kHanyuPinyin, maybe more for kVietnamese.
In my comments below, I will consider for kHanyuPinyin, kTang, kVietnamese and kXHC1983
that these are matched against NFD form, as this keeps the regular expressions readable,
whereas kHanyuPinlu and kMandarin can be changed with minimal impact to match NFC.

4. Numbers without leading zeros. There is a well defined way to express this fact, and it
is well applied in the syntax of kLau and kKarlgren and kMeyerWempe, but not in the syntax
of other properties, such as the radical-stroke counts kRS* (except kRSAdobe_Japan1_6).


Specific comments:
Taking fields in the order in which they appear in UAX #38, i.e. alphabetical order:

	kCantonese:
could be more strict, such as ^[a-z]{2,6}[1-6]$ (pronunciations have a minimum/maximum length)

	kCheungBauer:
Precise that / in the third part of the Syntax is used to separate possible tones, as U+55D7
kCheungBauer	030/10;RBBB;gut6/4/2,gwat1

	kDaeJaweon:
The Description does not indicates why the Syntax ends with [0158], just saying "the final
digit in the position being "0" for characters actually in the dictionary and "1" for characters
not found in the dictionary and assigned a "virtual" position in the dictionary."
So where do 5 and 8 come from?

	kDefinition:
Syntax should read ^[^\t"]*$, to exclude tabs and double quote, as well as giving a clue
that the definition is single-line (no line breaks allowed), as general comment 1.

	kFenn:
The Description does not explain what the small a that could happen after the number
is (the a? in the Syntax).

	kGSR:
In the Syntax, the asterisk (*) before the end must be a question mark (?), at most one
apostrophe can end the value of the property.

	kHanyu:
The description is more precise than the Syntax, saying that ""XY" is the zero-padded
number of the character on the page [01..32]", therefore the regular expression
should be ^[1-8][0-9]{4}\.[0-3][0-9][0-3]$

	kHanyuPinlu:
Given that the Unihan files are in NFC (see General comment 3.), U+0308 cannot happen,
and Syntax should be ^[a-z\x{FC}]+[1-5]\([0-9]+\)$
Moreover, while the Delimiter is correct, the Descrption incorrectly says that "the
list elements are "comma + space" delimited", it should read "the list elements are
space delimited".

	kHanyuPinyin:
Different issues with the syntax here:
1. With the assumption I made in general comment 3., to keep the Syntax self-consistent,
remove \x{FC}, because it is composed and can be obtained with u + \x{308}.
2. Independently, the syntax given, even with its missing right parenthesis at the end,
is incorrect.
   With correction of 1. above applied, the regular expression of the Syntax should be
   ^(\d{5}\.\d{2}0,)*\d{5}\.\d{2}0:([a-z\x{300}-\x{302}\x{304}\x{308}\x{30C}]+,)*[a-z\x{300}-\x{302}\x{304}\x{308}\x{30C}]+$
3. The presence of U+0302 COMBINING CIRCUMFLEX ACCENT is surprising here, it is not
customarily used in pinyin. In the database, only U+6B38 and U+8A92 were found to use it with
a double dicritic (U+0302 + U+0304 and U+0302 + U+030C) which also seems strange to me. It
could be right or it could be an input error.

	kHDZRadBreak:
The Syntax is too lose and isn't consistent with the description and the content of the database.
The code point between square brackets is the code point of the radical, so we can be more precise,
and since the character on which the break occurs is in the dictionary, the last digit is always 0.
And see my comment on kHanyu above.
So the Syntax should be ^[\x{2F00}-\x{2FD5}]\[U\+2F[0-9A-D][0-9A-F]\]:[1-8][0-9]{4}\.[0-3][0-9]0$

	kIRGDaeJaweon:
The Description does not says why the syntax gives the possibility of this property to be 0000.555

	kIRGDaiKanwaZiten:
the Syntax could be more precise, indeed the data matches ^[0-9]{5}'?$

	kIRG_GSource:
more precision can be given to the syntax:
• BK and CH are either alone or followed by six decimal digits
• FZ could be followed by _BK says the syntax, but I checked this never happens in the current
(5.2.0d1) version of the database, so the "(_BK)?" part could be deleted.
• KX is always followed by six decimal digits, the first never greater than 1
• The end is missing, the listed standards starting with G are listed without the initial G
and followed by a dash and four hexadecimal digits.
^4K|BK|((BK|CH)[0-9]{6})|CH|CY|FZ(_BK)?|HC|([HX]C[0-9]{6})|HZ|(KX[01][0-9]{5})|((JZ|ZJW|ZFY|CYY|HZ|FZ)[0-9]{5})|(H[0-9]{6})|([0135789ES]-[0-9A-F]{4})$

	kIRG_KSource:
The Syntax mentions:
• a K5 source, present in the database, not mentioned in the description
• a K7 source, missing in the database.
-> alter the Syntax to omit the 7.
-> complete the description

	kIRG_TSource:
The sources TC, TD and TE are not documented in the description.

	kIRG_VSource:
The source V4 is not documented in the description.

	kMandarin:
See general comment 3. If the normalization form is taken into account, should replace
\x{308} by \x{DC} in the syntax.

	kMatthews:
As I said in general comment 4., the fact that the index is an integer not zero-padded
should be expressed as [1-9][0-9]{0,3}
-> the resulting expression is ^[1-9][0-9]{0,3}(a|.5)?$
The meaning of the .5 (virtual position or added later?) and the "a" suffix are not documented.

	kMorohashi:
A lot of entries in the database have a value of 99999, I guess meaning that the character
is not in the dictionary. So either mention it in the description, or remove all the
superfluous entries from the database.

	kRSAdobe_Japan1_6:
Syntax: The plus sign should be escaped with \ as it is a special character for regular
expressions, and as noted in general comment 1., the period should also be escaped for
the same reason.
--> ^[CV]\+[0-9]{1,5}\+[1-9][0-9]{0,2}\.[1-9][0-9]?\.[0-9]{1,2}$

	kRSJapanese, kRSKangXi, kRSKanWa, kRSKorean:
See General comment 4. (no leading zero). The syntax shoud be ^[1-9][0-9]{0,2}\.[0-9]{1,2}$
The Description is misleading, there is no apostrophe in the database, and neither in the syntax.

	kRSUnicode:
See General comment 4. (no leading zero). The syntax shoud be --> ^[1-9][0-9]{0,2}'?\.[0-9]{1,2}$

	kSBGY:
See my comment on kHanyu, the Description precises that ""XY" is the zero-padded number of
the character on the page [01..73]", so the syntax becomes ^[0-9]{3}.[0-7][0-9]$

	kSemanticVariant, kSpecializedSemanticVariant:
The Syntax is inexact, it doesn't take well into account the specification of the description
concerning the colon.
It should be ^U\+2?[0-9A-F]{4}(<k[A-Za-z0-9]+:T?B?Z?(,k[A-Za-z0-9]+:T?B?Z?)*)?$

	kSimplifiedVariant:
Syntax should be like other Variant properties, so ^U\+2?[0-9A-F]{4}$

	kTang:
Syntax: same potential issue as kHanyuPinlu on normalization form
Syntax: a backslash must precede each x, and the asterisk should be escaped, as it is a
special character (maybe not necessary, but for clarity).
Resulting Syntax: ^\*?[A-Za-z()\x{E6}\x{251}\x{259}\x{25B}\x{300}\x{30C}]+$

	kVietnamese:
See General comment 3. (normalization form potential issue).

	kXHC1983:
See General comment 3. (normalization form potential issue).
The Description is inaccurate: it reports "Each pīnyīn reading is preceded by the character’s
location(s) in the dictionary, separated from the reading by ": " (colon,space)", while in
the database (and in the syntax) they are only separated by a colon, nothing more.
The page part of the Syntax is not precise, however the precise syntax is given in the
description, why not copy from here ?
As in kTang, there are missing backslatshes before each x.
Plus same as my comment 1. on kHanyuPinyin.
Resulting Syntax: ^[0-9]{4}\.[0-9]{3}\*?(,[0-9]{4}\.[0-9]{3}\*?)*:[a-z\x{300}\x{301}\x{304}\x{308}\x{30C}]+$

	kZVariant:
The description should precise what the optional part after the colon is (another property
that is the source?), or refer to other k*Variant properties.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --