Re: How to use UTF-16 in HTML pages (was: Polish codepages)

From: John Cowan (jcowan@reutershealth.com)
Date: Tue Feb 15 2000 - 13:01:27 EST


Otto Stolz wrote:

> The HTML 4.01 standard <http://www.w3.org/TR/REC-html40/charset.html> says:
> > The "charset" parameter identifies a character encoding, which is a method
> > of converting a sequence of bytes into a sequence of characters.

Certainly one must announce that the charset (= character encoding scheme)
is UTF-16. But "sequence of bytes into sequence of characters" is explicitly
meant to allow for a transformation other than one byte to one character;
it handles ISO-2022-JP, for example, and UTF-16 is not different in
principle.

> Note: both say "bytes" or "octets", not "16-bit units".

Irrelevant, as 2 octets make a 16-bit unit.
 
> Now, I cannot see how this could work with the HTML Meta tag: To even
> recognize the Meta tag, the receiver has to know the transfer encoding.

Rather say, needs to know *something* about the character encoding scheme,
not necessarily all about it. ("Transfer encoding" is generally
reserved for things like Base64 and Quoted-Printable.)

XML, which is far more rigorously defined than HTML, provides a normative
inline specification of the character encoding scheme, and a non-normative
method of figuring out what it is without an infinite regress, essentially
as follows:

> Note that the Meta tag scheme (though a kludge) still works with ASCII
> supersets, such as ISO 8859 and UTF-8, as the Meta tag is entirely in
> ASCII, hence can be recognized by the client.

The Meta tag can be recognized in 99.99% of all cases by scanning for
"<META" (case-folded) in ASCII, EBCDIC (any variant), UTF-16-BE, and
UTF-16-LE; for completeness, one could add UTF-32 in various
byte orderings. False matches are conceivable but not likely.

Having learned that much, one can then recognize letters and digits
and other characters of the invariant repertoire, and parse the Meta
tag to determine the exact character encoding scheme.

> The only way I can imagine to work is to announce UTF-16 (such as any
> other non-ASCII-superset) out-of-band, i. e. in the HTTP entity header
> (content-type field, charset subfield).

Of course that is the best way.

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT