RE: Communicator Unicode

From: Glenn Adams (
Date: Wed Oct 01 1997 - 07:37:18 EDT

At 07:26 AM 9/29/97 -0700, you wrote:
>>> You can't. A given entity must all be in a single encoding of
>>> the document character set.
>>Gavin is incorrect. Since it is clear here that it is the entity's
>>storage object being referred to, the encoding of the storage object has
>>no necessary relationship to the document character set. Furthermore,
>>the encoding of the entity as processed by an HTML parse also has no
>>necessary relationship to the document character set. For all intents
>>and purposes, the document character set is only useful in HTML for
>>determining how to interpret numeric character references.
>Correct me if I'm wrong, but doesn't the document character set define
>the repertoire of characters that are legal within a document, and
>what roles they should play (here I am actually using "document
>character set" to include the syntax character set)? To me this means
>that the entity must, is some way, encode characters from the document
>character set.

Yes, you are correct that the document character set defines the permissible
repertoire; but this has no necessary relationship on how the character of
that repertoire are encoded, either in the storage object for the purpose
of interchange, or in the parser for the purpose of processing. An entity
whose storage object was encoded with EBCDIC could be processed by a parser
whose internal processing code was ASCII and still be conformant against a
DTD whose document character set is ISO/IEC 10646. Clearly such an entity could
not contain characters outside of the ASCII subset of the 10646 repertoire.
Also, clearly, any numeric character reference found in this document which
was outside the range � to  could not be represented for futher
processing; however, it could be validated as conformant or non-conformant
(i.e., either being an SGML character or NONSGML). This situation might
produce a warning by a conformant processor but not a reportable markup
>There is only ever a single document character set in SGML, HTML, and
>XML. I stand by my claim that you cannot mix "charsets" or "character
>sets" in a single entity.

Since what Murray is talking about is character encoding system(s) as
employed in the representation of storage objects and not the repertoire
abstraction as employed by the document character set, you two are talking
about completely different things. SGML (and HTML) clearly permits
multiple character encoding systems in the representation of an entity's
storage object. See ISO 8879 E.3 Device-Independent Code Extension, the
explanation of how to use general-purpose ISO 2022 in a single entity.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT