Re: UTF-8 and UTF-16 issues

From: Tony Graham (tgraham@mulberrytech.com)
Date: Tue Jun 20 2000 - 12:20:41 EDT


At 19 Jun 2000 19:03 -0800, Tony Graham wrote:
> According to Appendix F, Autodetection of Character Encodings
> (Non-Normative), beginning a parsed entity with the UTF-8 BOM counts
> as:
>
> other: UTF-8 without an encoding declaration, or else the data
> stream is corrupt, fragmentary, or enclosed in a wrapper of some
> kind

Oops. The XML Recommendation errata at
http://www.w3.org/XML/xml-19980210-errata#E44 changes the list of
significant byte patterns to include:

With a Byte Order Mark:
 00 00 FE FF: UCS-4, big-endian machine (1234 order)
 FF FE 00 00: UCS-4, little-endian machine (4321 order)
 FE FF 00 ##: UTF-16, big-endian
 FF FE ## 00: UTF-16, little-endian
 EF BB BF: UTF-8

UTF-8 with the BOM is (non-normatively) okay according to the XML
Recommendation. Success with XML processors may vary, however, since
this wasn't decided until May 1999 or, it seems, added to the
published errata until January of this year.

Regards,

Tony Graham
======================================================================
Tony Graham mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9632
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT