L2/13-017 Source: Michel Suignard Date: January 22, 2013 Subject: Unihan data file change Please add the following message/content in the UTC agenda. Note that this is not a Unicode 6.3 issue. Issue statements: --------------------- 1) 10646 normative CJK data files cannot be directly used by the character chart production. This creates a small issue on the 10646 production side (solved by running a special Perl script) and a bigger issue for the Unicode side because Unihan.zip cannot be trusted, meaning that the Unicode charts are produced with a file which is not synchronized with Unihan data. 2) synchronization between Unicode and 10646. Someone has to take ISO/IEC 10646 data files and input them into the Unihan database. It has been suggested that tools could be created/improved to make the process smoother. However the regex for the CJK sources is inherently unstable, and such tools would have to be rechecked/updated every time a new data file drop is made from the 10646 side. Proposed solutions: ----------------------- On the 10646 side create a new data file compatible with Unihan containing the following items: kIRG_Gsource kIRG_Hsource kIRG_Msource kIRG_Tsource kIRG_Jsource kIRG_Ksource kIRG_KPsource kIRG_Vsource kIRG_Usource kRSUnicode kCompatibilityVariant kIICore Notes: kRSUnicode would be augmented on the 10646 side to contain RS variants to be identical to the Unihan current values. kIICore regex would be changed to [ABC]{1}[GHJKMPT]{1-7} (There are 7 possible sources: G,T,J,H,K,M,KP(noted as P) and a priority value (A to C)) kCompatibilityVariant was recently deprecated, but because it is a normative value for 10646, it would need to be changed to something like 'derived'. Benefits ---------- It would simplify greatly life for everyone if the normative data files for 10646 was the same format as Unihan. No need to run some grep or any format conversion tool. There are some significant benefits on the 10646 side: - replace 3 files (CJKU_SR.txt, CJKC_SR.txt, and IICORE.txt) with a single one. - no need to run a perl script or another similar conversion utility to convert format - can be used directly by Unibook (consequence of previous benefit). Because it is mostly editorial (none of the changes affect normative values, just how they are presented in linked data files), it does not require heavy lifting on WG2 side. it can proposed as project editor input. And because perl scripts already exist to go both way, it is trivial to do the conversion to this proposed new data file format. On the Unicode side: By having the 10646 file being identical to the Unihan generated data file it becomes trivial to determine synchronicity and the chart for both Unicode and 10646 can be safely produced with the same file. Notes: -------- To some degree what UTC do with the new file is up to the group, the new file can be used as a mere input to the Unihan database or we can go beyond. Changing the script to use the new one is pretty easy, I don't like the verbose aspect of Unihan files but they are very easy to regex. It does open some perspective to improve synchronicity but it is up to UTC and can be processed on their own term. Michel