L2/02-454 From: John H. Jenkins Date: 2002-12-09 19:04:02 -0800 Subject: Report on IRG #20 The 20th meeting of the IRG was held in Hanoi, Vietnam, from 18 to 21 November 2002. I attended as the liaison from Unicode, and Hideki Hiura of Sun attended for L2. Other delegations were from the People's Republic of China, Taipei Computer Association (for Taiwan), Hong Kong SAR, Macao SAR, Japan, the Republic of Korea (South Korea), the Democratic People's Republic of Korea (North Korea), Vietnam, and Singapore. A full set of the documents distributed at the meeting is available at the IRG's Web site, . This is one of the most productive IRG meetings I've attended. Except for plenaries, primarily on the first and final days, the delegates were split into two groups: an Editorial Committee, and an ad hoc group which came up with resolutions to problems raised by the Editorial Committee. The former had a hard time of it; they were supposed to go through the revisions sent in by member bodies to the current Extension C1 data set, but there were innumerable minor problems that kept them from making as much progress as they would have liked. One of the major problems plaguing the Editorial Committee was a lack of a uniform process for determining the radical of a character, its stroke count, and the shape of its first stroke. This is making it difficult for IRG editors to look up characters either in the Extension C1 data or in the currently standardized Unihan repertoire. Another was tracking whether or not two variant component shapes should be considered "the same" or "different" in order to maintain consistency with previous decisions. In the latter case, Annex S of 10646 provides some guidance, as it has examples of unifiable and non-unifiable shapes, but the examples in Annex S are not exhaustive. Annex S, moreover, has not been updated in the light of Extension B work. To resolve these problems, the IRG adopted the concept of an "IRG Editorial Radical" for a character and created two standing documents to cover stroke counts and unification examples. IRG Editorial Radicals The IRG Editorial Radical is accepted as being an artificial construct, providing a uniform, algorithmic way of determining a character's radical which does not necessarily reflect the actual radical which would be used for the character in a dictionary. The latter depends in many cases on a knowledge of the character's meaning, and for the obscure characters found in Extension C, this knowledge may not be general among IRG members. For example, the character U+6C39 氹 is the name of one of the islands making up the Macao SAR and as such is classified under the water radical. If one didn't know its meaning, however, one would tend to look it up under the "second" radical. For IRG purposes, the latter is more useful. The IRG Editorial Radical is found by writing the character using an Ideographic Description Sequence. The sequence should be as short as possible while including at least one radical. The first radical in the IDS is taken as the IRG Editorial Radical. If a character cannot be written using an IDS, then the first stroke of the character is mapped to one of the first five KangXi radicals (horizontal line, dot, and so on), which is then used as the character's IRG Editorial Radical. New Standing Documents The first of the new standing documents is a set of common components found in ideographs, together with an accepted (if arbitrary) stroke count and first stroke shape. Thus, whenever the grass radical is found as a component of an ideograph, it is counted as having four strokes, even if the actual shape of the glyph uses the three-stroke form. This list is not exhaustive, and may be added to as needed by the Editorial Committee. The second standing document is an extension of the unification and non-unification component examples from Annex S. It is intended to be the current list of components which the IRG has identified as being unifiable (or not). Again, it is not intended to be exhaustive and may be added to as needed by the Editorial Committee. Handling Doubtful Unifications Another major result of the ad hoc was a refocus of IRG unification procedures. In the past, IRG editors have tended to adopt a policy of "when in doubt, don't unify." That is, unless they *know* that a pair of ideographs should be unified, they tend not to unify them. This has been altered to "when in doubt, unify," on the basis that an incorrect unification is an easier problem to fix later on than an incorrect disunification. It is (among other things) in response to this that the South Korean delegation has removed some 3500 characters from their Extension C1 proposal. Standard Unihan Subset The IRG has also reported favorably to WG2 on the need for a standard subset of Unihan to be used throughout East Asia. Such a subset would probably run to about 7,000 or 8,000 characters and will included virtually every character needed for most modern texts. This should be a tremendous advantage to implementers, as it allows fonts, user interfaces, input methods, and so on to optimize on that relatively small number of ideographs which are genuinely useful. The IRG also adopted a procedure which may be used to create this subset. Action Item for UTC: We need to determine if there is a Unicode contribution to this set. In our case, we should start with EACC and add or remove characters as we feel they are genuinely useful for modern texts. Administrative Matters All IRG members are to appoint an editor for their delegation. This is to be the individual who is responsible for the quality of their submissions to the IRG and should be able to answer questions about their submission if they come up. Action Item for UTC: I've been acting in this role, although not formally appointed. I assume the UTC and L2 would want to make this formal. Mr. Zhang was approved for another term as the IRG's rapporteur, but he really, really, REALLY doesn't want to keep doing it. He hopes that he can retire after one more year, and that someone will volunteer in the near future to be his successor so that there can be a transition period. The hosts did an outstanding job, given the infrastructure problems in Hanoi. The only real difficulty we faced was Internet access being limited to a small number of dial-up lines with a maximum speed of 28000 baud, but that was to be expected under the circumstances. Otherwise, they managed things very well, including an interesting tour of Hanoi. One could tell we were a bunch of character geeks; when we toured to the Ethnological Museum, we tended to walk past displays of clothing, housing, games, and so on, but lovingly crowded around all the glass cases full of writing materials. And, of course, there's the lake in central Hanoi with a tiny island in the middle, on top of which sits a "turtle pagoda" honoring the legend of the giant turtle who there received from one of Vietnam's kings his magical sword (rather like the Lady of the Lake and Excalibur, with the difference that there isn't a stuffed Lady of the Lake in a temple on the shore). Bulldog Award James Seng, the entire delegation of Singapore, was one of the hardest workers at the meeting. Not only did he have to split himself in two in order to attend both the ad hoc and the Editorial Committee, but he made important contributions. He wrote up a "Annex S for Dummies" document, which he'll continue to work on. Its purpose is to explain in less formal language how the IRG does its unification work, how the submission process works, and so on. He also did considerable behind-the-scenes work on issues such as the advisability of separating pure strokes from the current set of ideographs in Extension C1 and submitting them to WG2 to be part of a "stroke block" in 10646, and other issues. And, of course, apologies for the lateness of this summary; but Hideki and I both seem to have acquired interesting microbes during our stay in Hanoi which have kept us at less than peak efficiency since we got home. ========== John H. Jenkins