Re: UTF-8 and UTF-16 issues

From: Tony Graham (
Date: Tue Jun 20 2000 - 12:20:41 EDT

At 19 Jun 2000 19:03 -0800, Tony Graham wrote:
> According to Appendix F, Autodetection of Character Encodings
> (Non-Normative), beginning a parsed entity with the UTF-8 BOM counts
> as:
> other: UTF-8 without an encoding declaration, or else the data
> stream is corrupt, fragmentary, or enclosed in a wrapper of some
> kind

Oops. The XML Recommendation errata at changes the list of
significant byte patterns to include:

With a Byte Order Mark:
 00 00 FE FF: UCS-4, big-endian machine (1234 order)
 FF FE 00 00: UCS-4, little-endian machine (4321 order)
 FE FF 00 ##: UTF-16, big-endian
 FF FE ## 00: UTF-16, little-endian

UTF-8 with the BOM is (non-normatively) okay according to the XML
Recommendation. Success with XML processors may vary, however, since
this wasn't decided until May 1999 or, it seems, added to the
published errata until January of this year.


