Re: UTF-8 and UTF-16 issues

From: Tony Graham (tgraham@mulberrytech.com)
Date: Mon Jun 19 2000 - 23:16:17 EDT


At 19 Jun 2000 14:48 -0800, Markus Scherer wrote:
> > the BOM was intended to be used in 16-bit encodings like UTF-16, not in
> > UTF-8.
>
> it is still useful to use the signature byte sequences in all
> unicode encodings. the xml spec, for example lists them as a help
> for the parser. if it is not generally recommended, then it should
> be. for utf-8 and scsu it just indicates these encodings without
> also needing to indicate endianness. the signature still serves a
> purpose.

The XML Recommendation requires use of the BOM with UTF-16 to
differentiate between UTF-8 and UTF-16 documents. From section 4.3.3,
Character Encoding in Entities:

   Entities encoded in UTF-16 must begin with the Byte Order Mark
   described by ISO/IEC 10646 Annex E and Unicode Appendix B (the ZERO
   WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
   signature, not part of either the markup or the character data of
   the XML document. XML processors must be able to use this character
   to differentiate between UTF-8 and UTF-16 encoded documents.

The XML Recommendation is silent on the use of the BOM with UTF-8
encoded documents, and I suspect that the BOM at the beginning of a
UTF-8 encoded XML document or external parsed entity will confuse many
XML processors.

According to Appendix F, Autodetection of Character Encodings
(Non-Normative), beginning a parsed entity with the UTF-8 BOM counts
as:

   other: UTF-8 without an encoding declaration, or else the data
   stream is corrupt, fragmentary, or enclosed in a wrapper of some
   kind

If an XML processor does not recognise the byte sequence as the BOM
and discard it as it would the BOM in a UTF-16 encoded document, then
that processor will either treat the document as UTF-8 without an
encoding declaration or consider the UTF-8 BOM to be a fatal error and
stop processing. If the parsed entity that is being treated as UTF-8
is an XML document (not an external parsed entity), then the processor
will still throw an error since an initial ZERO WIDTH NO-BREAK SPACE
character doesn't match the production for an XML document.

It's probably safest to leave the BOM off UTF-8 encoded XML documents.

Regards,

Tony Graham
======================================================================
Tony Graham mailto:tgraham@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9632
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT