RE: Communicator Unicode

From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Tue Sep 30 1997 - 06:17:00 EDT


On Mon, 29 Sep 1997, Gavin Nicol wrote:

> >> You can't. A given entity must all be in a single encoding of
> >> the document character set.

Glenn Adams wrote:

> >Gavin is incorrect. Since it is clear here that it is the entity's
> >storage object being referred to, the encoding of the storage object has
> >no necessary relationship to the document character set. Furthermore,
> >the encoding of the entity as processed by an HTML parse also has no
> >necessary relationship to the document character set. For all intents
> >and purposes, the document character set is only useful in HTML for
> >determining how to interpret numeric character references.
>
> Correct me if I'm wrong, but doesn't the document character set define
> the repertoire of characters that are legal within a document, and
> what roles they should play (here I am actually using "document
> character set" to include the syntax character set)? To me this means
> that the entity must, is some way, encode characters from the document
> character set.
>
> There is only ever a single document character set in SGML, HTML, and
> XML. I stand by my claim that you cannot mix "charsets" or "character
> sets" in a single entity.
>
> I know that SGML doesn't say anything about handling of non-SGML
> characters, but I do not believe that this detracts from the overall
> argument.

There are several things. The document character set of HTML is
iso 10646/unicode (it was Latin-1). This is one single document
character set for all of HTML all over the world (as Latin-1 is
a subset of n10646/Unicode).

For the transmission of HTML, in particular with HTTP, the encoding
of the document is indicated with a "charset" parameter. In this
case, this is one encoding per document, or one encoding per "serve",
or even one encoding per connection step via proxies if we have
a transcoding proxy. But at a single place and time, a single
HTML document in this case always only is encoded in a single encoding,
describable with a single "charset" parameter. There are absolutely
no mechanisms, and none are discussed, to "switch" charsets in the
middle of a HTML document. This is what I think Murray was most
interested in.

However, and I think that that's what Glenn was referring to,
HTML as such as an SGML DTD does not constrain the transmission
encoding discussed in the last paragraph do be a single encoding
per document. HTML as an SGML application only defines the document
character set, and is therefore in theory completely independent
of an character encoding on transmission and storage, including
whatever weird conventions people come up with. Some of these
conventions in turn might be formalised and registered as "charset"s
and therefore used with HTTP and so on, others however may not,
or only with great difficulties. So in theory, an HTML document
can be stored in whatever form, including switching of character
encodings in the middle of the document, in whatever weird ways.
But this is theory, it is not really relevant for practice, as
far as I know.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT