L2/98-352
November 9, 1998

CEN/TC304 PT Guide on Character Sets
Comments by Ken Whistler 


Mats Linder,

I have looked over the draft Guide at http://www.stri.is/TC304/GUIDE/
and have some feedback for you that may be of relevance to the
discussion at your open meeting.

In section 4.3 re the universal character set standard, you do mention
that 10646-1 has been subject to a number of amendments, but here and
elsewhere in the guide do not, I think, give sufficient emphasis to the
importance and impact of these on procurement issues. Formally, an
amendment comes into effect when it is published, and quite a number
of amendments to 10646-1 have been published. But with a standard
as complex as 10646-1, it is quite impossible for either vendors or
procurers to obtain the original published standard and all the published
amendments and to make coherent sense of that stack of documents.

One critical example for European procurement: anyone missing Amendment 18 
(which just entered DAM voting, so is not even submitted for final publication)
would be missing the EURO SIGN. And I hardly think that any European
procuring agency thinking about the UCS would find that desirable!

You do mention the Unicode Standard (and it should be called the "Unicode
Standard", and not "UNICODE"). I think it is worth pointing out, in
addition to mentioning that "every attempt is made to ensure that they
are kept in line", that the Unicode Standard, Version 2.0 (the
currently published book), is exactly, code-for-code, identical to
ISO/IEC 10646-1:1993 plus the first 7 published Amendments to 10646-1.
That is an important stake in the ground, both for vendors and
procuring agencies, since it marks a defined content, at at defined time,
with a defined set of Amendments to the IS. The Unicode Standard,
Version 2.1 (defined by the published book at the Unicode Technical
Report #8, available online), contains two more characters from
Amendment 18 to 10646-1: U+20AC EURO SIGN and U+FFFC OBJECT REPLACEMENT CHARACTER,
both of which are key, required characters for vendors and for the
software that procuring agencies would want to acquire. The Unicode
Standard, Version 3.0, which is currently undergoing editing in preparation
for publication, will be exactly, code-for-code, identical to the
republished ISO/IEC 10646-1:1999 (~2000?), which will roll in the
first 31 (!) published Amendments to 10646-1. When this becomes available,
it will be the next major stake in the ground for vendors and procuring
agencies.

Not to mention at least some of these details in a guide for Europe
regarding procuring issues for character sets would be an unfortunate
omission, in my opinion.

******

The following sentence in section 4.3 is misleading:

"It is planned that the BMP will contain, with the exception of the
Chinese and Japanese ideographs, the characters, including combining
characters, needed to write all the known living langauges of the world."

The fact that some more Chinese characters are planned for encoding
on Plane 2 of 10646 (in ISO/IEC 10646-2, when that becomes available)
should not detract from the fact that 27,486 Han characters (for both
Chinese and Japanese) are *already* encoded on the BMP. This is
far in excess of any of the individual national language character
sets widely implemented in various Asian countries, and is more than
sufficient for writing Chinese and Japanese (which typically only
require a few thousand characters for >99.999% of all characters used).
The mechanisms required for dealing with gaiji (including corporate
characters, individual personal names, etc.) are not in principle
any different from what is already required in the various national
Asian character sets. There is no reason (except residual rhetoric)
to stick a line in a procurement guide that is likely to be read
by a procurement officer along the lines of: "I've heard that
Unicode doesn't work for Japanese, and this European procurement
guide confirms that."

******

In section 6.3 Code structure interoperability

The statement "It is believed that the supply industry is favouring
the use of signatures." should be qualified. The vendors have more
or less agreed on the use of the UCS signatures for plain text files,
which are a form of interchange. However, the more general practice
is the use of "higher-level protocols" specified by the standards
in which Unicode/10646 data is made use of. Examples are the
specification of UTF-8 as a MIME charset; specification of Unicode
(UTF-16) as the character encoding for the String object in Java;
specification of Unicode as the reference character set for HTML
and XML, etc. In these contexts, no use of signatures is implied
or expected, nor are the escape code mechanisms of identification
from 10646 used. The latter are only intended for interoperability
with implementations making direct use of the facilities of ISO 2022.

In any case, it is unlikely that a procuring agency should know about
or be concerned with usages such as signatures. They should, however,
know what other protocols and standards the character set is
supposed to interoperate with and choose accordingly. An obvious
example would be an expectation of an implementation to interoperate
with Java.

******

In general, I applaud the focus of the Guide on *repertoire*
specifications, rather than on encoding or character set
specifications. Repertoire is what an end user can reasonably
be concerned about, except at the point where data must be
interchanged with some particular protocol. And at that point,
if a specification is properly qualified to required support for
interchange formats X, Y, Z, then vendors can be expected to
structure their applications appropriately to do just that.

******

In particular, in section 7.6, I think it may be overkill for
a procurement specification to be specifying a requirement
that a product "shall support both the normal ordering and
reverse ordering of octets" and "shall support the use of
all signatures as specified in ISO/IEC 10646-1:1993". These
decisions depend rather dramatically on what kind of application
you are talking about. An application that sits on top of
Windows, for instance, and which doesn't touch the Internet,
may never see a "normal" ordering of octets for Unicode --
everything will be in LSB order. And that might be perfectly o.k.
for the operation of that application.

It is far more important, and relevant, for example, to specify
whether a product must support UTF-8 in interchange -- which is not something
I see addressed here.

*****

Since you reference (correctly) the IBM CDRA document, it would
make sense for you to also reference the Unicode Standard itself
in this document:

The Unicode Standard, Version 2.0
Addison-Wesley Developers Press, 1996
ISBN 0-0201-48345-9

The Unicode Standard, Version 2.1
The Unicode Consortium, 1998
http://www.unicode.org/unicode/reports/tr8.html

Sincerely,

--Ken Whistler, Technical Director, Unicode, Inc.