RE: help

From: Mike Brown (
Date: Tue Aug 29 2000 - 00:49:35 EDT

> Are there any HTML pages in the Unicode character set
> i.e The entire HTML page is in Unicode ( including the
> tags , attributes ) .

Based on the way you asked your question, I think some clarifications are in

An HTML or XML document exists in abstract form as a sequence of abstract
characters from a very large subset of the repertoire covered by Unicode
(actually, ISO/IEC 10646-1). The document may exist in tangible form, for
storage or transmission, as a sequence of bits, which in turn may be grouped
into bytes or other fixed bit widths.

The procedure for mapping the abstract Unicode characters to certain
bit/byte/whatever sequences is an encoding "scheme". A mapping of particular
characters to particular bit/byte/whatever sequences is a "character set".
Most character sets map single characters to single octets (8-bit bytes),
but it is not uncommon for characters to be mapped to sequences of more than
one octet (UTF-8 and Shift-JIS, for example). Unicode Technical Report #17
describes a number of intermediate layers of abstraction, but for your
purposes, you are probably only concerned with these kinds of character

There is a list of character sets approved for use on the Internet at The character
set names and aliases in this list are what may go in "charset" parameters
of MIME and HTTP "Content-Type" headers, or in the "encoding" attribute in
the prolog of an XML entity.

Right now the Unicode character repertoire is expressed in terms of "Unicode
values" which are a sequence of 1 or 2 values that are 16 bits wide, and
notated like "U+1234" in print, leading people to believe that there is a
"Unicode character set" that maps abstract characters to specific bit
sequences. The reality, as expressed in UTR #17, is that 16 bit code values
may manifest in different ways in different computer architectures. The
issues basically boil down to matters of endianness when the values are
split into 8-bit chunks, and additional bits that might be added to the
beginnings of encoded documents to signify this situation (byte order

Since Unicode covers every abstract character, *every* character set maps
some subset of Unicode's repertoire to bit/byte sequences; so in a sense,
all encoded documents are "in the Unicode character set". The Unicode
Standard, certain IETF RFCs, and certain amendments to ISO/IEC 10646-1
define a few encoding schemes / transformation formats that effectively map
the entire Unicode repertoire to bit/byte sequences. There are a few
character sets implied or specified by these schemes/formats, and these do
appear in the IANA's character set list:

ISO-10646-UTF-1 or csISO10646UTF1
UNICODE-1-1 or csUnicode11
UNICODE-1-1-UTF-7 or csUnicode11UTF7

The first 4 are deprecated and all but abandoned, and you will have a hard
time finding any UTF-16, UTF-16BE, UTF-16LE encoded HTML documents, because,
as someone else pointed out, few browsers support them. You can find UTF-8
encoded HTML documents pretty easily, though. Any document consisting purely
of ASCII bytes will do, even if it uses Ӓ or &SGMLentity; references
to non-ASCII characters. This is because UTF-8 supersets ASCII (0x20-0x7E).

If you want to actually see some non-ASCII characters represented as UTF-8
byte sequences in pages that declare themselves to be UTF-8 encoded, have a
look through the HTML at

Perhaps after reading this you may decide you don't really want to see HTML
"in the Unicode character set" at all :)

   - Mike
Mike J. Brown, software engineer at My XML/XSL resources: in Denver, Colorado, USA

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT