Report about Beijing

L2/00-108

Report on the WG2 meeting #38 in Beijing, March 21-24, 2000

March 30, 2000

From: Kenneth Whistler [kenw@sybase.com]

The voluminous meeting minutes that Uma keeps for the WG2 meetings will no doubt be available for review soon, but in the meantime, here are the highlights of the meeting, with a focus on issues from the U.S. national body perspective.

A. Disposition of Comments for the CD for 10646-2

This was the main piece of work for the meeting, and occupied the majority of the time of the meeting. Working through everything was rather complicated, but in the end, little actually changed in the encoding.

The most important thing to note is that no chunks were removed from the CD. Ireland had asked for Byzantine and Western musical symbols to be removed; Sweden asked for the math alphanumerics and all of Plane 14 to be removed; Germany asked for Deseret and math alphanumerics to be removed; Finland asked for the Japanese 302 JIS 213 characters to be removed (actually moved to the BMP); Canada asked for those and 1739 other Han characters to be moved to the BMP. If all of these requests had been accomodated, there would not have been much left (except a still enormous chunk of Han characters). In the end, none of these requests was accomodated, since all such large removals could have jeopardized the vote of some other NB.

The changes to the code tables that were agreed on were as follows:

1. The Etruscan glyph at 1031B will be reversed.

2. Gothic precomposed letter 1033A EIS WITH DIAERESIS will be removed and the rest of the Gothic letters moved up. (Interestingly, the request to remove this precomposed letter was from two European NB’s.)

3. The relevant dotted circles will be added to the Western musical symbols, per the U.S. comments.

4. The name of 1D191 VOID NOTEHEAD was questioned, with the suggestion that it be changed to WHITE NOTEHEAD. (The problem is the semantic clash between VOID NOTEHEAD and NULL NOTEHEAD; as well as the usual naming convention in 10646 for FILLED and WHITE to indicate black versus outlined versions of things.)

5. The committee agreed to the editor’s suggestion to add back the 5 missing straight-barred capital theta symbols, which were a mistake in generating the charts from the original document.

6. The IRG editor’s report on Vertical Extension B was endorsed by China, Japan, and the U.S., and was adopted intact.

The big issue of whether to peel out some chunk of unified ideographs from Extension B to encode on the BMP was focussed on the Canadian proposal, WG2 N2183. That proposal suggested that 2041 characters be removed from Extension B, to be encoded on the BMP. That total of 2041 included 302 from JIS X 0213, 168 from Korea, and 1571 from Hong Kong (1088 GCCS + 483 HKSAR KX column).

The proposal basically failed to garner any consent. China flatly opposed it. Korea refused to take a position. Japan said, almost literally, “Of course, Japan would not refuse if a present were given to us, but we cannot take a position for or against this proposal.” The U.S. stated its agreed-upon position against the proposal. TCA did not speak in favor of the proposal. Ireland spoke against it—largely because of the large inflation in the size of the repertoire now asked for (from the original Japan 302 up to 2041, which would have spilled greatly beyond the only big open chunk on the BMP which could be used to encode these). The U.K. was initially tentatively for, but quickly waffled back when they heard the strong objection from China. Faced with no clear support from the IRG countries, Canada basically folded their cards and the meeting went on to other issues.

China has committed to supplying the revised Vertical Extension B, incorporating the IRG Editor’s report (and with all the characters resorted, to close up any gaps) by the middle of April. It was quite clear that any agreement to remove a chunk from Vertical Extension B would have jeopardized the FCD for 10646-2, since the issue would have had to go back to the IRG in June, even if IBM Canada could supply the exact list of characters almost immediately, and since the country in charge of the technology needed to actually finish the charts and tables was basically opposed to doing that work. But Michel is supposed to issue the FCD for 10646-2 in May, well before the IRG could even start work on the separation.

By the way, WG2 also agreed to eliminate the anachronistic gap of 0000..00FF at the beginning of Plane 2. That apparently was stuck in as the result of an inconclusive discussion in Copenhagen about whether or not to allow room for possible encoding of new CJK symbols on Plane 2. That decision was snuffed in Beijing, and the recoded Extension B for the FCD will start at 20000, instead of 20100.

The other major change for 10646-2 was the addition of the TCA proposed repertoire of compatibility Han characters from CNS-11643. (See the next major topic.)

The big oversight that I know about is that the U.S. failed to ask for the replacement of the open-face italic math alphabet with the bold fraktur math alphabet. This was a mistake and misunderstanding that barbara beeton had pointed out, but which Michel and I forgot to bring up in the midst of all the heated CJK discussions. Mea culpa. At this point, I think we can still get this fixed if we remember to put in a strong U.S. comment to this effect in the FCD comments. Even though it will mean changing the names and glyphs for 52 characters, it will be clear to everyone that this is really a minor fix in the context of the entire math alphanumerics section.

In addition to the encoding and/or glyph changes summarized above, there were a large number of individually small changes made to the text of the draft standard. These were mostly just small clarifications, and have no noticeable impact on the technical content of the standard.

B. CJK Compatibility Characters

There were 3 major contributions on this topic: WG2 N2159 was a request from TCA to encode 527 CNS 11643:1992 compatibility characters on Plane 2. WG2 N2196 was a contribution from Japan trying to spell out all the principles needed for dealing with compatibility characters, including the principles needed for considerations of unification of compatibility characters. WG2 N2197 was a request from Japan to encode 61 JIS X 213 compatibility characters in the BMP. (The last is the revised version of the 56 that the UTC already approved for encoding in the BMP and that L2 endorsed.)

I represented the U.S. stated position that WG2 should request that the IRG consider these (along with other candidates from the Hong Kong repertoires) for unification. The IRG representatives pushed back. Michel, speaking as editor, stated the position that it probably wasn’t worth bothering with because of the “miniscule” chance that there were any characters in common, anyway. I said not so. So overnight I did the first level of unification and demonstrated that just between the two proposals in hand, N2159 and N2196, there were 18 candidates for unification between the JIS X 213 set and the CNS 11643 set, plus 4 more in the JIS X 213 set that could be unified with existing radical symbols, plus 1 more in the JIS X 213 set that might be unified with an already encoded compatibility character (FA25).

The next day, China, TCA, Japan, and the U.S. held an ad hoc to sort this out. The urgency for a decision was driven by the same considerations as for splitting Vertical Extension B. Anything taken to the IRG would have to wait until June to even get started, which would preclude anything getting into the FCD for 10646-2, unless the entire schedule were to be reworked. Some in the ad hoc wanted to argue principles first, but I objected that we might get nowhere that way, and suggested we work from the concrete examples I had identified in the proposals. That quickly led us to a rather productive and detailed discussion of some of the problems involved in trying to unify compatibility characters. There was no question regarding the unified character identity of the candidates— the issues really come down to different standards having different criteria for how they separate out and codify particular distinct glyphs. That creates really significant problems when you try to unify the compatibility characters. It quickly became evident that there was no way the ad hoc could propose and justify a unification for the two proposals (let alone other candidate sets) during the WG2 meeting. That left only two viable options: 1. Remand the whole thing to the IRG, to establish principles, gather repertoires, and decide on unifications among the compatibility sets; or 2. Simply treat each compatibility repertoire from a separate source as a distinct entity and encode each repertoire separately. In the end, the ad hoc recommended course #2, because of the urgency for dealing with the mapping issues both for JIS X 213 and CNS 11643:1992, as more practical, and WG2 adopted that position.

So the upshot is:

The 527 CNS 11643:1992 compatibility characters are to be added to the FCD 10646-2, to be encoded in Plane 2, at 2F800..2FA0E. And the discussion of Plane 2 in Part 2 will be modified, to account for an area from 2F800..2FFFD which is preallocated to CJK compatibility characters. That, presumably, is where the Hong Kong compatibility characters will end up later.

The 61 JIS X 213 compatibility characters were accepted into the WG2 “bucket” for addition in a future amendment to 10646-1, with encodings as proposed: FA30..FA6B. This was in accord with the U.S. position (following UTC action on this set). I personally protested about the 4 radical additions, which will get us another claw radical, another one-dot running radical, and, yes folks, two(!) more grass radicals. But I really could not hold to that position for the U.S., since we had already agreed in principle to the additions (sight unseen) over the 56 that we had already accepted.

Japan has an action item to significantly revise N2197 to provide the kind of detailed information about the mappings that TCA provided for the CNS characters in N2159.

C. Math Symbols Proposal

The major contribution for encoding the additional math symbols came in as WG2 N2191R, from the U.S. I spoke for the proposal, and there was essentially no opposition from anyone else. WG2 had seen earlier draft tables for this, so they knew this was coming. Several NB’s asked to review the charts further, and I have provided them with the list of all the glyph problems we know about in what we were able to prepare in N2191R.

For strategic reasons, the convenor chose not to try to press for any new amendment for 10646-1 to come out of the Beijing meeting—including this repertoire. Instead, the other NB’s will get to chew over the working document, and in Greece in September, it is anticipated that we can use the math symbols proposal as the basis for a large, “omnibus” amendment that can pick up all the other accepted characters sitting in WG2’s bucket at that time, to progress to PDAM ballotting.

I am incorporating NB feedback and responses on technical questions from the AMS folks into a revised version of the document that Asmus and I will bring into the April UTC/L2 meeting for recertification by the UTC, so that we can be clear about just what we are carrying in to Greece for initiating the amendment.

D. Other Firm Additions to the BMP

A number of other small additions to the BMP were approved. The UTC will need to discuss and approve most of these.

1. 8 Cyrillic Sami characters. (WG2 N2173)

These are the revision for the 6 that were pulled at the last minute from 10646-1:2000 (and Unicode 3.0), plus 2 extras. They are encoded in the holes in the extended Cyrillic area.

2. U+17DD KHMER SIGN LAAK (WG2 N2164)

This is a distinct form of the Khmer “etc.” sign. The same document asked for the removal of 4 characters not used in modern Khmer and the four “Bauhahn” characters that he claims usage for in Pali transcription/transliteration. The “no removals” policy was explained again, and no removals were done.

3. U+20B0 GERMAN PENNY SYMBOL (WG2 N2188)

This is the one that the UTC has already adopted.

4. U+20B1 PESO SIGN (cap P with two bars) (WG2 N2156)

This is the form used in the Philippines.

5. 5 additional Yi radicals (WG2 N2207)

These were the 5 omitted radicals that we could not get agreement on at the London meeting. China acquiesced this time to the Irish proposal, so they go in, in the holes in the existing Yi radical chart.

6. 23CD (??) SQUARE FOOT (WG2 N2184)

214A PROPERTY LINE

These are the characters from Asmus’ proposal. SQUARE INCH was not encoded, as there is still no evidence for it. The encoding for SQFT is problematical, as the one suggested during the meeting has a clash. This will need to be worked out together with the math symbols proposal.

E. Tentative Additions to the BMP (WG2 N2195) for CJK symbols

In N2195, Japan presented more evidence of usage for most of the symbols from JIS X 213 that the UTC had queried. Of these, the committee came to the general consensus that items 1 to 12 in N2195 had sufficient justification and should be accepted for encoding. Two of these were already in the math symbols proposal, and the mandate of the WG2 resolution was to work the remaining ones into the context of the math symbols proposal, based on feedback to the expert’s group on this topic. So I have an action item to provide suggested encodings for these and add them to the math symbols proposal—pending UTC consideration and feedback from Japan, in particular. The additions are:

1. A double plus sign and a triple plus sign

2. 15 dentist symbols (which will get treated like more box-drawing characters)

3. Katakana double hyphen (U+30AF)

4. left and right white parentheses

5. another iteration mark (pending clarification from Japan)

6. the masu mark

7. two katakana vertical digraphs (koto and yori)

8. the part alternation mark (marks the start of a song)

9. a white and black sesame dot

For a probable total of 27 more characters.

The U.S. has an action item to respond to the Japanese reaction regarding the rising and falling tone letters. (Item 3.3 in N2195)

Japan considers the precomposed Kana for Ainu to be back in their court again.

And all the circled numbers are an open issue still. Japan wants a usable composition method for representing these, if the UTC refuses to accede to any more circled numbers.

F. Architectural Change to 10646: Limit to 10FFFF

The U.S. proposal to eliminate the private use planes and groups above 10FFFF was accepted! The text proposed in WG2 N2175 will be incorporated into a future amendment to 10646-1 -- presumably the same amendment that will incorporate the text changes that Michel is gathering that result from adding Part 2 to the standard.

G. Editorial correction to 10646-1: Khmer

A major error in the Khmer charts was pointed out by the Cambodian delegation. 8 of the left-side or two-part vowels were printed wrong in 10646-1:2000 and in the Unicode Standard, Version 3.0. An editorial correction was initiated to start the fix on the 10646 side, and appropriate action was also started to fix the Unicode book.

H. New collection identifiers for 10646-1 (WG2 N2211)

WG2 agreed to add new collection identifiers for the four MES repertoires for Europe: MES-1, MES-2, MES-3A, MES-3B. But WG2 did not agree to a further request to break up collection 63 ALPHABETIC PRESENTATION FORMS into two further collection identifiers (for the left-to-right ones and the right-to-left ones). Nor, after some discussion, did it come to any consensus for starting down the road to creating collections for particular languages.

I. Roadmap Updates

The roadmap documents that Michael Everson maintains underwent review. New versions are being created, and there are now roadmaps for each plane of 10646-2 -- not just Plane 1. The revised documents are being sent to SC2 as part of the WG2 guidelines for its work, so we are moving, very gradually, towards general SC2 acceptance of the concept of preallocation for future script additions. As long as we don’t press too hard on this, I think this fits nicely with the UTC desire to find a way to preallocate character properties across unassigned code points. There is a slow convergence of views here that will get us where we want to go, I believe.

J. Things that did *not* happen

1. There was a major discussion of the Democratic People’s Republic of Korea (DPRK) proposal to reencode Korean. The 7(!) member delegation from the DPRK gave a formal presentation arguing their position. All the other national bodies explained why moving and renaming the characters wasn’t going to happen. And the DPRK was invited to provide a more limited proposal just asking for new characters to cover any ancient jamos and symbols that they needed. The DPRK was not happy about this, but stuck it out and kept arguing their position. Toward the end it was revealed that the NP proposal had already failed at the JTC1 level, so much of this was moot, except for its communication value in explaining to the DPRK why they couldn’t do what they wanted.

2. There were no Armenian representatives, but the Armenian proposal to reencode Armenian was discussed—and, of course, nothing happened there, either, except for an action item to communicate to the Armenians once again the policy about not moving or renaming characters.

3. The Arabic presentation forms for Uighur issue was finally closed out. The Chinese Uighur expert was present, and he was eventually convinced by demonstrations of on-the-fly shaping in Arabic Windows, which Mike Ksar had conveniently running on his laptop to show. WG2 went on record as not wanting to encode any more Arabic presentation forms: “WG2 resolves not to add any more Arabic presentation forms to the standard...” Hurrah!

4. WG2 considered the Japanese proposal to encode “ng” and “NG” as characters for Tagalog (WG2 N2165) and (sensibly) rejected it.

L. SC2 Liaison to Unicode

Mike Ksar was nominated to serve as the SC2 liaison to the Unicode Consortium.

I forgot to mention one significant item: Japan's proposal

to add Annexes to TR 15285 (the Character-Glyph Model).

The document that L2 has already seen, and which got roundly criticized at the last UTC/L2 meeting, is WG2 N2148.

At the WG2 meeting, Japan brought in 3 new contributions, N2198, N2199, and N2206, and stated that those three should be considered as complete replacements for N2148, which was submitted hastily as just a placeholder document for the agenda topic.

N2198 is a framework paper, trying to describe the perceived problem and suggesting two new Annexes to TR 15285.

N2199 is a sketchy draft for a new Annex-X, “Requirements for Coded elements.” As explained at the meeting, this is basically aimed at addressing the acceptance and education issues regarding 10646 in South and Southeast Asian countries in particular. The document lays out, very cursorily, a kind of typology for how to go about encoding a new script, based on existing models in the standard. It is intended to counter the tendency for groups new to the standard to simply bring in their complete list of glyphs and expect that to be appropriate for encoding of their script.

N2206 is a sketchy draft for a new Annex-Y, “Why Input Assistance”. This is intended to describe the concept of input methods, and to explain, using terminology of “user-friendliness” and “machine-friendliness”, why the requirements for user-friendliness in input don’t necessarily match the encoding of the characters themselves.

What Japan seems to be aiming for here is some kind of vehicle in an ISO document that they can use to support their government-funded initiative to assist countries of Southeast Asia and Oceania in computer development. Sato-san, in particular, is facing a lot of pushback in explaining 10646 to the various groups he encounters, and is trying to fend off rejection of 10646 and the tendency for local groups to come up with “better” local solutions that tend to be the simplistic one character equals one glyph kind of encoding.

WG2 took no action, other than suggesting that the NB’s review the documents.

Given this tactical explanation from Japan, I think L2 should take another look at what Japan is proposing to do, and see if there is any way we can assist in addressing the underlying need, if not necessarily the exact mechanism of adding these particular annexes to TR 15285.

--Ken