Re: browsers and unicode surrogates

From: Steffen Kamp (
Date: Fri Apr 19 2002 - 17:25:26 EDT

>I have added a couple more variations of the Unicode supplementary
>characters example page, for utf-16 and utf-32.

I am not sure if your UTF-16 and UTF-32 test pages really conform to the
HTML standard. The server states a content type of "text/html" without
charset information. From the content type a browser should therefore
expect pure ASCII - at least until the META tag defining the documents
character encoding.

From the HTML 4.01 specification <
charset.html>, section 5.2.2:

"The META declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at
least until the META element is parsed)."

Your documents, however, just start with a BOM and I couldn't find
anything stating that a BOM would be a valid way of specifying the
character encoding.
Although some browsers seem to guess the character encoding from an
available BOM I wouldn't expect them to do so when there usually are
other ways of determining this information.

To get a second opinion I asked's online validation service to
check your UTF-16 document with auto detection of the character encoding.
The Validator complained about the BOM as well as (not surprisingly) a
lot of ASCII zero (0x00) characters.
However, when giving the validator a ASCII-only document with a META tag
specifying UTF-16 as encoding (just for testing) it says that it does not
yet support this encoding, so I don't fully trust the validator in this case.


Steffen Kamp

This archive was generated by hypermail 2.1.2 : Fri Apr 19 2002 - 18:17:05 EDT