Re: browsers and unicode surrogates

From: Tex Texin (
Date: Mon Apr 22 2002 - 02:50:53 EDT

Jungshik Shin,

Hi! Just a couple of minor comments.

Opera 6 lists UTF-16 as an encoding. Netscape 6.2 lists UTF-16LE. IE 6
does not list any UTF other than UTF-8. I haven't noticed any encodings
becoming available or unavailable when I access pages with different
encodings, but maybe there is a configuration setting that determines
this behavior.

According to Tom Gewecke, none of the browsers on the Mac display these
I have been adding info to the table on the introduction page as I get
browser reports.

These pages are just for testing purposes. Some of the browser folks
found the UTF-8 supplementary plane pages useful and I thought the
UTF-16 would be interesting and useful as well. The utf-32 was a lark.

The pages have BOMs and the encoding is declared in the META statement.
Since the UTF-32 encoding is very new in the IANA registry I didn't
expect it to be meaningful yet to most products.

As for HTML, the HTML 4.01 spec states in section 5.2.1:

"When HTML text is transmitted in UTF-16 (charset=UTF-16), text data
should be transmitted in network byte order ("big-endian", high-order
byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE],
clause C3, page 3-1.

Furthermore, to maximize chances of proper interpretation, it is
recommended that documents transmitted as UTF-16 always begin with a
ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called
Byte Order Mark (BOM))..."

It goes on to say the user agent should be provided the encoding via the
HTTP protocol. The META statement is kind of a fallback in case the HTTP
does not provide the encoding.

Also note, that even though the document is encoded as utf-16, the
network may transmit it in another encoding.

In looking at the HTML 4.01 spec to quote the above, I noted an
interesting sentence:
"The META declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at
least until the META element is parsed)."

I am surprised by the "must only be used". It seems I am not conforming
by including a meta statement in the utf-16 HTML page. I should either
remove the statement or encode the HTML up to and including that
statement as ascii. I'll check on this.

tex wrote:
> On Fri, 19 Apr 2002, Tom Gewecke wrote:
> > >I have added a couple more variations of the Unicode supplementary
> > >characters example page, for utf-16 and utf-32.
> >
> > I had the impression that it was not really practical to use web pages with
> > these encodings over the internet, because they do not preserve ascii and
> > are not compatible with html. Could someone enlighten me on this?
> UTF-16 and UTF-32 have drawbacks you mentioned and may not be
> practical. (Personally, I would never put up html files in those encodings
> other than for testing purpose.) Nonetheless, neither of them is forbidden
> by any standard. Actually, W3 html standard explicitly mentions them
> as possible encodings for html files.
> With BOM at the beginning, Netscape 4.x, Netscape 6.x/Mozilla and MS
> IE 5.x/6.x can handle them without much problem except that support
> for characters above BMP varies from browser to browser as Tex tried to
> demonstrate in his test pages. IIRC, none of those browsers has UTF-16
> and UTF-32 'visible' in 'Encoding' menu. UTF-16 and UTF-32 entries in
> 'Encoding' menu get 'exposed' only when users try to view UTF-16 and
> UTF-32 encoded pages.
> Jungshik Shin

Tex Texin                    Director, International Business    the Progress Company
Tel: +1-781-280-4271
"The world writes in my database!" Progress Exchange 2002
Globalization Empowerment for Progress users
A compelling demonstration for Unicode:

This archive was generated by hypermail 2.1.2 : Mon Apr 22 2002 - 04:32:40 EDT