From: Kenneth Whistler (email@example.com)
Date: Thu Aug 07 2003 - 20:21:25 EDT
An anonymous wag who picks the nits even finer that I did
wishes the following clarification to be posted regarding
an assertion I made about what Unicode code points are
------------- Begin Forwarded Message -------------
> So, yeah, basically every sequence of code points "assigned to
> abstract characters" is "legal" for interchange. What you cannot
> interchange are code points with gc=Cs (U+D800..U+DFFF) or
> code points with gc=Cn (noncharacters and reserved).
You *can* interchange reserved characters. You *should* not originate
them, but if you are passed a string with them, you should preserve
them, and pass them on. And in most circumstances you can depend on
them being preserved. For noncharacters you can interchange, but
should not depend on them being preserved.
You *can* also interchange Cs characters; just not within conformant
UTF encoding scheme/forms. But it is perfectly legal for me to have a
record with a field containing an *arbitrary Unicode code point*,
serialize that record, and send it off.
---------------End Forwarded Message ------------------
I concur with the general intent of this clarification, but
this is definitely in the gray area as regards exactly what
the conformance claims for the standard means.
It is certainly good practice and the most robust approach
to an implementation for it to behave the way suggested here,
but note also the following letter of the law from 10646,
to which the Unicode Standard itself claims conformance:
2.2 Conformance of information interchange
A code-character-data-element (CC-data-element) within coded
information for interchange is in conformance with ISO/IEC
a) all the coded representations of graphic characters
within that CC-data-element conform to clauses 6 and
b) all the graphic characters represented within that
CC-data-element are taken from those within an identified
subset (clause 12)
7. General requirements for the UCS
b. Code positions to which a character is not allocated,
except for the positions reserved for private use characters
or for transformation formats, are reserved for future
standardization and shall not be used for any other
2.2.a and 7.b imply that it is not conformant to interchange
reserved code points, and 2.2.b implies that what you can
interchange are only the assigned characters from a subset
(in the Unicode case, of course, the subset of the whole).
So the way I would summarize this is:
I. Reserved code points
A conformant implementation should not originate them, but
because conformant implementations may be designed to work
with multiple versions of the standard and may encounter
uplevel data, good implementation practice is to follow the
Unicode recommendations about not munging uninterpreted
code points and about passing them along unharmed.
These cannot be used in open interchange, although they can,
of course be used in "internal" interchange, which is
essentially a private agreement (perhaps with oneself) regarding
what noncharacter usage those code points have. No external
recipient can interpret them, nor is an external recipient
obliged to preserve them if received.
III. Surrogate code points
I would claim, contra the above, that these *cannot* be
interchanged in conformance with the standard -- at all.
If one is attempting to interchange arbitrary Unicode code
points, including Cs code points (U-0000D800..U-0000DFFF),
this cannot be done with a well-formed encoding form, and
thus cannot be done in conformance with the standard.
If one claims to be *interchanging* such code points in
the context of a Unicode string (which does not, of course,
have to be well-formed to constitute a "Unicode string" by
the definition in the standard), then such interchange
is effectively a protocol built on top of the standard,
rather than something in conformance with the standard
At any rate, that is how *I* would pick the nits.
This archive was generated by hypermail 2.1.5 : Thu Aug 07 2003 - 20:57:18 EDT