RE: How to distinguish UTF-8 from Latin-* ?

From: Michael Kaplan (Trigeminal Inc.) (v-michka@microsoft.com)
Date: Sun Jun 18 2000 - 15:09:51 EDT


> if it is xml, then have a look at the xml spec (with the errata list!!).
> it is very clearly specified how to figure that all out there.
> <?xml version="1.0" encoding="utf-8"?>...
>
Actually, the XML spec is quite clear that neither UTF-16 nor UTF-8 require
the encoding tag.... XML is defined by one of the following:

1) Starts with byte Mark for Big-Endian/Little-Endian Unicode -- go with the
byte mark

2) No encoding information... UTF-8 can be assumed (often it is just ASCII
so this works)

3) Any other encoding, use the encoding tag as Marcus mentions

For more detail, see 4.3.3 of the spec, at
http://www.w3.org/TR/REC-xml#charencoding and note the language in regard to
the encoding tag:

Although an XML processor is required to read only entities in the UTF-8 and
UTF-16 encodings, it is recognized that other encodings are used around the
world, and it may be desired for XML processors to read entities that use
them. Parsed entities which are stored in an encoding other than UTF-8 or
UTF-16 must begin with a text declaration containing an encoding declaration

Clearly, we are being told that this is not a requirement of an XML
processor. Unfortunately, most of the ones out there do not understand the
encoding tag, cannot read UTF-16 files, and destroy UTF-8 outside of the
ASCII range.

Michael



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT