L2/98-352 November 9, 1998 CEN/TC304 PT Guide on Character Sets Comments by Ken Whistler Mats Linder, I have looked over the draft Guide at http://www.stri.is/TC304/GUIDE/ and have some feedback for you that may be of relevance to the discussion at your open meeting. In section 4.3 re the universal character set standard, you do mention that 10646-1 has been subject to a number of amendments, but here and elsewhere in the guide do not, I think, give sufficient emphasis to the importance and impact of these on procurement issues. Formally, an amendment comes into effect when it is published, and quite a number of amendments to 10646-1 have been published. But with a standard as complex as 10646-1, it is quite impossible for either vendors or procurers to obtain the original published standard and all the published amendments and to make coherent sense of that stack of documents. One critical example for European procurement: anyone missing Amendment 18 (which just entered DAM voting, so is not even submitted for final publication) would be missing the EURO SIGN. And I hardly think that any European procuring agency thinking about the UCS would find that desirable! You do mention the Unicode Standard (and it should be called the "Unicode Standard", and not "UNICODE"). I think it is worth pointing out, in addition to mentioning that "every attempt is made to ensure that they are kept in line", that the Unicode Standard, Version 2.0 (the currently published book), is exactly, code-for-code, identical to ISO/IEC 10646-1:1993 plus the first 7 published Amendments to 10646-1. That is an important stake in the ground, both for vendors and procuring agencies, since it marks a defined content, at at defined time, with a defined set of Amendments to the IS. The Unicode Standard, Version 2.1 (defined by the published book at the Unicode Technical Report #8, available online), contains two more characters from Amendment 18 to 10646-1: U+20AC EURO SIGN and U+FFFC OBJECT REPLACEMENT CHARACTER, both of which are key, required characters for vendors and for the software that procuring agencies would want to acquire. The Unicode Standard, Version 3.0, which is currently undergoing editing in preparation for publication, will be exactly, code-for-code, identical to the republished ISO/IEC 10646-1:1999 (~2000?), which will roll in the first 31 (!) published Amendments to 10646-1. When this becomes available, it will be the next major stake in the ground for vendors and procuring agencies. Not to mention at least some of these details in a guide for Europe regarding procuring issues for character sets would be an unfortunate omission, in my opinion. ****** The following sentence in section 4.3 is misleading: "It is planned that the BMP will contain, with the exception of the Chinese and Japanese ideographs, the characters, including combining characters, needed to write all the known living langauges of the world." The fact that some more Chinese characters are planned for encoding on Plane 2 of 10646 (in ISO/IEC 10646-2, when that becomes available) should not detract from the fact that 27,486 Han characters (for both Chinese and Japanese) are *already* encoded on the BMP. This is far in excess of any of the individual national language character sets widely implemented in various Asian countries, and is more than sufficient for writing Chinese and Japanese (which typically only require a few thousand characters for >99.999% of all characters used). The mechanisms required for dealing with gaiji (including corporate characters, individual personal names, etc.) are not in principle any different from what is already required in the various national Asian character sets. There is no reason (except residual rhetoric) to stick a line in a procurement guide that is likely to be read by a procurement officer along the lines of: "I've heard that Unicode doesn't work for Japanese, and this European procurement guide confirms that." ****** In section 6.3 Code structure interoperability The statement "It is believed that the supply industry is favouring the use of signatures." should be qualified. The vendors have more or less agreed on the use of the UCS signatures for plain text files, which are a form of interchange. However, the more general practice is the use of "higher-level protocols" specified by the standards in which Unicode/10646 data is made use of. Examples are the specification of UTF-8 as a MIME charset; specification of Unicode (UTF-16) as the character encoding for the String object in Java; specification of Unicode as the reference character set for HTML and XML, etc. In these contexts, no use of signatures is implied or expected, nor are the escape code mechanisms of identification from 10646 used. The latter are only intended for interoperability with implementations making direct use of the facilities of ISO 2022. In any case, it is unlikely that a procuring agency should know about or be concerned with usages such as signatures. They should, however, know what other protocols and standards the character set is supposed to interoperate with and choose accordingly. An obvious example would be an expectation of an implementation to interoperate with Java. ****** In general, I applaud the focus of the Guide on *repertoire* specifications, rather than on encoding or character set specifications. Repertoire is what an end user can reasonably be concerned about, except at the point where data must be interchanged with some particular protocol. And at that point, if a specification is properly qualified to required support for interchange formats X, Y, Z, then vendors can be expected to structure their applications appropriately to do just that. ****** In particular, in section 7.6, I think it may be overkill for a procurement specification to be specifying a requirement that a product "shall support both the normal ordering and reverse ordering of octets" and "shall support the use of all signatures as specified in ISO/IEC 10646-1:1993". These decisions depend rather dramatically on what kind of application you are talking about. An application that sits on top of Windows, for instance, and which doesn't touch the Internet, may never see a "normal" ordering of octets for Unicode -- everything will be in LSB order. And that might be perfectly o.k. for the operation of that application. It is far more important, and relevant, for example, to specify whether a product must support UTF-8 in interchange -- which is not something I see addressed here. ***** Since you reference (correctly) the IBM CDRA document, it would make sense for you to also reference the Unicode Standard itself in this document: The Unicode Standard, Version 2.0 Addison-Wesley Developers Press, 1996 ISBN 0-0201-48345-9 The Unicode Standard, Version 2.1 The Unicode Consortium, 1998 http://www.unicode.org/unicode/reports/tr8.html Sincerely, --Ken Whistler, Technical Director, Unicode, Inc.