Re: browsers and unicode surrogates

From: Steffen Kamp (steffen@ic.ac.uk)
Date: Fri Apr 19 2002 - 17:25:26 EDT

Previous message: Misha.Wolf@reuters.com: "LAST Call for Papers - 22nd Unicode Conference - Sep 2002 - San Jose, CA"
Maybe in reply to: Tex Texin: "browsers and unicode surrogates"
Next in thread: Stefan Persson: "Re: browsers and unicode surrogates"
Reply: Stefan Persson: "Re: browsers and unicode surrogates"
Reply: Tex Texin: "Re: browsers and unicode surrogates"
Reply: Martin Duerst: "Re: browsers and unicode surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>I have added a couple more variations of the Unicode supplementary
>characters example page, for utf-16 and utf-32.

I am not sure if your UTF-16 and UTF-32 test pages really conform to the
HTML standard. The server states a content type of "text/html" without
charset information. From the content type a browser should therefore
expect pure ASCII - at least until the META tag defining the documents
character encoding.

From the HTML 4.01 specification <http://www.w3.org/TR/html4/
charset.html>, section 5.2.2:

"The META declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at
least until the META element is parsed)."

Your documents, however, just start with a BOM and I couldn't find
anything stating that a BOM would be a valid way of specifying the
character encoding.
Although some browsers seem to guess the character encoding from an
available BOM I wouldn't expect them to do so when there usually are
other ways of determining this information.

To get a second opinion I asked w3.org's online validation service to
check your UTF-16 document with auto detection of the character encoding.
(<http://validator.w3.org/check?uri=http://www.i18nguy.com/unicode/
plane1-utf-16.html&charset=(detect+automatically)&doctype=Inline>)
The Validator complained about the BOM as well as (not surprisingly) a
lot of ASCII zero (0x00) characters.
However, when giving the validator a ASCII-only document with a META tag
specifying UTF-16 as encoding (just for testing) it says that it does not
yet support this encoding, so I don't fully trust the validator in this case.

Steffen

-- 
Steffen Kamp
mailto:steffen@ic.ac.uk
http://homepage.mac.com/earthlingsoft

Previous message: Misha.Wolf@reuters.com: "LAST Call for Papers - 22nd Unicode Conference - Sep 2002 - San Jose, CA"
Maybe in reply to: Tex Texin: "browsers and unicode surrogates"
Next in thread: Stefan Persson: "Re: browsers and unicode surrogates"
Reply: Stefan Persson: "Re: browsers and unicode surrogates"
Reply: Tex Texin: "Re: browsers and unicode surrogates"
Reply: Martin Duerst: "Re: browsers and unicode surrogates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Apr 19 2002 - 18:17:05 EDT