Re: UTF-8 and UTF-16 issues

From: Tony Graham (
Date: Tue Jun 20 2000 - 12:20:41 EDT

At 19 Jun 2000 19:03 -0800, Tony Graham wrote:
> According to Appendix F, Autodetection of Character Encodings
> (Non-Normative), beginning a parsed entity with the UTF-8 BOM counts
> as:
> other: UTF-8 without an encoding declaration, or else the data
> stream is corrupt, fragmentary, or enclosed in a wrapper of some
> kind

Oops. The XML Recommendation errata at changes the list of
significant byte patterns to include:

With a Byte Order Mark:
 00 00 FE FF: UCS-4, big-endian machine (1234 order)
 FF FE 00 00: UCS-4, little-endian machine (4321 order)
 FE FF 00 ##: UTF-16, big-endian
 FF FE ## 00: UTF-16, little-endian

UTF-8 with the BOM is (non-normatively) okay according to the XML
Recommendation. Success with XML processors may vary, however, since
this wasn't decided until May 1999 or, it seems, added to the
published errata until January of this year.


Tony Graham
Tony Graham
Mulberry Technologies, Inc.
17 West Jefferson Street Direct Phone: 301/315-9632
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT