From: James Kass (thunder-bird@earthlink.net)
Date: Fri Oct 26 2007 - 00:20:39 CDT
John H. Jenkins wrote,
> There is actual considerable room for improvement.
There is always room for improvement in any system.
> First of all, the experience of Extension C showed that there was a
> serious QA problem in the IRG. The amount of effort involved in
> identifying unifiable pairs entirely by hand left the whole process
> error-prone. This has largely been corrected with Extension D work.
To save those unfamiliar with the abbreviation the trouble of looking
it up, "QA" means quality assurance in this case. During the review
period and prior to formal encoding of Extension C, some problems with
Ext. C were brought to the attention of IRG. IRG responded admirably,
resulting in better QC (quality control) for future work.
Goes to show that public review periods are essential, they might even
be considered as part of the QA/QC process.
> Secondly, the whole issue of "distinct ideographs" is getting nastier
> and nastier as the IRG has to deal with increasingly rare characters
> of uncertain provenance and meaning. So long as the IRG continues to
> treat each "distinct" ideograph as something that needs independent
> encoding, this is going to be a problem that plagues us.
As you may know, I've been studying and trying to get a solid
understanding of CJK unification. Something I'm having trouble
grasping is why identical/otherwise-unifiable pairs are considered
non-unifiable if they come from two different sources with two
apparently different meanings. After all, in UNIHAN.TXT there
are many single characters with more than one definition. Just
as there are many English words with more than one meaning.
(Examples exist, like U+3ADA (㫚) and U+66F6 (曶).)
So, if a rare character has uncertain provenance and meaning, but
it is unifiable, shouldn't it just be unified? And, if that character
is not unifiable, but it exists in texts (however obscure) that
someone may wish to reproduce electronically (for posterity,
perhaps), shouldn't it be encoded?
> If, for example, we'd had the concept of variant selectors an
> established part of the standard during the Extension B work, the IRG
> could have saved literally thousands of code points which are now
> dedicated to obscure variants found in the Hanyu Da Zidian. If we
> abandon the idea that every distinct ideograph requires separate
> encoding, we could speed up the whole process, improve the quality of
> work, and -- most important -- make implementation much simpler.
We seem to have drifted off-topic for this thread. I thought
about changing the thread title to "CJK unification and variation
selectors", but that might get me started on VS characters again.
Is it really possible to speed up the process of encoding an
open-ended set?
Best regards,
James Kass
This archive was generated by hypermail 2.1.5 : Fri Oct 26 2007 - 00:22:25 CDT