From: Kenneth Whistler (firstname.lastname@example.org)
Date: Fri May 16 2003 - 15:33:45 EDT
Philippe Verdy stated:
> Unicode only defines codepoints, not their serialization into
> code units and not technical aspect such as byte order (which
> is important for UTF-16 and UTF-32, also used to encode subsets
> or sursets of Unicode such as the old UCS2 (which is just a
> restriction of Unicode to the BMP but does not define a specific
Doug Ewell already responded to some of the issues in this post,
but a few more issues need some rectification.
In the above paragraph, I think there is a confusion which results
from unclear usage of the phrase "Unicode defines...".
If understand as "the character encoding of the Unicode Standard
only defines code points...", then that is correct. The character
encoding per se is just the assignment of code points to abstract
If, however, understood as "The Unicode Standard only defines
code points, not their serialization into code units..." then
that is clearly incorrect on several grounds.
First, the Unicode Standard *does* also define encoding forms
and their code units, and also defines encoding schemes and
the byte serializations they use.
Second, code points are not *serialized* into code units.
Serialization is an issue for encoding schemes, and is the
serialization of the code units into byte sequences. Again,
see Chapter 3 of The Unicode Standard, Version 4.0 for all
> One could argue that all *precisely defined* legacy character
> encodings (this includes the new GB2312 encoding)
As Doug pointed out, Philippe probably means GB 18030 here.
GB 2312 is an *old* character encoding standard. It was published
> that work on subsets of Unicode are Unicode conformant,
This is a misapplication of the term "Unicode-conformant".
Legacy character encoding standards outside the context
of the Unicode Standard (and, indeed, often published before
there even was a Unicode Standard), cannot be conformant
to the Unicode Standard.
What I think Philippe is trying to indicate here is that
other character encodings which have repertoires that are
strict subsets of the Unicode Standard can *interoperate*
with implementations of the Unicode Standard.
> as they are encoding forms for their equivalent Unicode
> strings. However they must be considered as distinct
> encodings and character sets, because they cannot represent
> exactly all Unicode strings (including its non normalized forms).
There should be no question that other character encodings
are distinct character encodings. ;-)
The point seems to be that other legacy character encodings
have only a subset of the character repertoire of the
Unicode Standard, and thus cannot represent all Unicode
> However ISO2022 is conforming with Unicode,
This is *not* the case.
> and can be seen as an alternative for general purpose Unicode
> encoding forms,
This is also *not* the case.
> because of its ability to switch to many
> encoding forms including UTF* encoding forms.
I think what Philippe is trying to claim here is that by
use of ISO 2022 (and multiple, individual character sets
registered for use with ISO 2022, including ISO/IEC 10646,
of course), one can represent a large number of characters.
That is certainly true. And since ISO/IEC 10646, including
UTF-8 or UTF-16, can be used in the ISO 2022 framework, it
is trivially true that one can represent all Unicode
characters in an ISO 2022 framework. One can simply announce
UTF-8, e.g. with:
ESC %/I (0x1B 0x25 0x2F 0x49)
and then merrily continue with a UTF-8 data stream for
as long as one likes.
> The difference
> is that its full implementation is extremely complex as it is
> based on a repertoire of encodings not defined by Unicode, and
> requires a lot of specific parsers for each supported subsets
> and subencoding.
That certainly seems true to me. Nobody is going to dispute
that ISO 2022 implementations have complex character
This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 16:19:57 EDT